We are going to study on the factors that would affect the accommodation price on Airbnb in Brussels, and make a prediction on the total cost for 2 people staying 4 nights using proper model.
First of all, we performed exploratory data analysis to get an overall understanding of the data and examine the data in more detail. Next, we did some mappings to show our key insights. Then we construct the models, analyse the results and wrap it up in our conclusion.
Before we start, we need to import the data and conduct some basic exploratory data analysis.
#Import data
listings_csv <- vroom::vroom("http://data.insideairbnb.com/belgium/bru/brussels/2020-06-15/data/listings.csv.gz")
#save as another dataset
listings <- listings_csv
#take a glimpse of all the data
glimpse(listings)## Rows: 8,986
## Columns: 106
## $ id <dbl> 2352, 2354, 45145, 481...
## $ listing_url <chr> "https://www.airbnb.co...
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2....
## $ last_scraped <date> 2020-06-16, 2020-06-1...
## $ name <chr> "Triplex-2chmbrs,grand...
## $ summary <chr> "Cute 2 bedrooms appar...
## $ space <chr> "ENGLISH VERSION: The ...
## $ description <chr> "Cute 2 bedrooms appar...
## $ experiences_offered <chr> "none", "none", "none"...
## $ neighborhood_overview <chr> "Basilique Koekelberg,...
## $ notes <chr> "Les événements et les...
## $ transit <chr> "bus 49, tram 82 direc...
## $ access <chr> "easy from the ring by...
## $ interaction <chr> NA, NA, NA, NA, "If yo...
## $ house_rules <chr> "CHECK-IN. Regular che...
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url <chr> "https://a0.muscache.c...
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id <dbl> 2582, 2582, 199370, 21...
## $ host_url <chr> "https://www.airbnb.co...
## $ host_name <chr> "Oda", "Oda", "Erick",...
## $ host_since <date> 2008-08-28, 2008-08-2...
## $ host_location <chr> "Belgium", "Belgium", ...
## $ host_about <chr> "Hi there! I've been a...
## $ host_response_time <chr> "within an hour", "wit...
## $ host_response_rate <chr> "100%", "100%", "N/A",...
## $ host_acceptance_rate <chr> "N/A", "N/A", "N/A", "...
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, F...
## $ host_thumbnail_url <chr> "https://a0.muscache.c...
## $ host_picture_url <chr> "https://a0.muscache.c...
## $ host_neighbourhood <chr> "Molenbeek-Saint-Jean"...
## $ host_listings_count <dbl> 3, 3, 2, 1, 1, 1, 16, ...
## $ host_total_listings_count <dbl> 3, 3, 2, 1, 1, 1, 16, ...
## $ host_verifications <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified <lgl> FALSE, FALSE, TRUE, FA...
## $ street <chr> "Sint-Jans-Molenbeek, ...
## $ neighbourhood <chr> "Molenbeek-Saint-Jean"...
## $ neighbourhood_cleansed <chr> "Molenbeek-Saint-Jean"...
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA...
## $ city <chr> "Sint-Jans-Molenbeek",...
## $ state <chr> "Brussels", "Region Br...
## $ zipcode <dbl> 1080, 1080, 1210, 1150...
## $ market <chr> "Brussels", "Brussels"...
## $ smart_location <chr> "Sint-Jans-Molenbeek, ...
## $ country_code <chr> "BE", "BE", "BE", "BE"...
## $ country <chr> "Belgium", "Belgium", ...
## $ latitude <dbl> 50.9, 50.9, 50.9, 50.8...
## $ longitude <dbl> 4.31, 4.31, 4.37, 4.41...
## $ is_location_exact <lgl> TRUE, TRUE, FALSE, TRU...
## $ property_type <chr> "Apartment", "Apartmen...
## $ room_type <chr> "Entire home/apt", "En...
## $ accommodates <dbl> 5, 4, 2, 2, 3, 2, 3, 3...
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.5, 1....
## $ bedrooms <dbl> 2, 1, 1, 2, 1, 1, 0, 0...
## $ beds <dbl> 2, 1, 1, 2, 1, 1, 2, 2...
## $ bed_type <chr> "Real Bed", "Real Bed"...
## $ amenities <chr> "{TV,\"Cable TV\",Inte...
## $ square_feet <lgl> NA, FALSE, NA, NA, NA,...
## $ price <chr> "$93.00", "$78.00", "$...
## $ weekly_price <chr> "$625.00", "$546.00", ...
## $ monthly_price <chr> "$1,800.00", "$1,495.0...
## $ security_deposit <chr> "$150.00", "$150.00", ...
## $ cleaning_fee <chr> "$65.00", "$65.00", "$...
## $ guests_included <dbl> 4, 1, 1, 1, 2, 2, 2, 2...
## $ extra_people <chr> "$5.00", "$0.00", "$30...
## $ minimum_nights <dbl> 2, 2, 1, 2, 5, 2, 1, 1...
## $ maximum_nights <dbl> 365, 365, 730, 14, 120...
## $ minimum_minimum_nights <dbl> 2, 2, 2, 2, 5, 2, 1, 1...
## $ maximum_minimum_nights <dbl> 2, 2, 4, 2, 5, 2, 1, 1...
## $ minimum_maximum_nights <dbl> 1125, 1125, 730, 14, 1...
## $ maximum_maximum_nights <dbl> 1125, 1125, 730, 14, 1...
## $ minimum_nights_avg_ntm <dbl> 2, 2, 2, 2, 5, 2, 1, 1...
## $ maximum_nights_avg_ntm <dbl> 1125, 1125, 730, 14, 1...
## $ calendar_updated <chr> "2 months ago", "2 mon...
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30 <dbl> 29, 28, 29, 30, 26, 0,...
## $ availability_60 <dbl> 57, 58, 56, 60, 56, 0,...
## $ availability_90 <dbl> 87, 88, 64, 90, 86, 0,...
## $ availability_365 <dbl> 267, 363, 338, 365, 36...
## $ calendar_last_scraped <date> 2020-06-16, 2020-06-1...
## $ number_of_reviews <dbl> 15, 2, 3, 0, 105, 12, ...
## $ number_of_reviews_ltm <dbl> 0, 0, 0, 0, 11, 4, 0, ...
## $ first_review <date> 2014-04-19, 2016-04-2...
## $ last_review <date> 2018-12-31, 2018-10-2...
## $ review_scores_rating <dbl> 87, 80, 100, NA, 97, 9...
## $ review_scores_accuracy <dbl> 9, 10, 10, NA, 10, 10,...
## $ review_scores_cleanliness <dbl> 9, 10, 10, NA, 10, 10,...
## $ review_scores_checkin <dbl> 9, 10, 10, NA, 10, 10,...
## $ review_scores_communication <dbl> 9, 10, 8, NA, 10, 10, ...
## $ review_scores_location <dbl> 8, 10, 10, NA, 10, 10,...
## $ review_scores_value <dbl> 9, 10, 8, NA, 9, 10, 7...
## $ requires_license <lgl> FALSE, FALSE, FALSE, F...
## $ license <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA...
## $ instant_bookable <lgl> TRUE, TRUE, TRUE, FALS...
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy <chr> "strict_14_with_grace_...
## $ require_guest_profile_picture <lgl> TRUE, TRUE, FALSE, FAL...
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, F...
## $ calculated_host_listings_count <dbl> 2, 2, 2, 1, 1, 1, 16, ...
## $ calculated_host_listings_count_entire_homes <dbl> 2, 2, 0, 1, 1, 1, 16, ...
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0...
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month <dbl> 0.20, 0.04, 0.03, NA, ...
| Name | listings |
| Number of rows | 8986 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 45 |
| Date | 5 |
| logical | 17 |
| numeric | 39 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 8986 | 0 |
| name | 12 | 1.00 | 1 | 217 | 0 | 8815 | 0 |
| summary | 409 | 0.95 | 1 | 1000 | 0 | 8056 | 0 |
| space | 3216 | 0.64 | 1 | 1000 | 0 | 5374 | 0 |
| description | 291 | 0.97 | 1 | 1000 | 0 | 8431 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 3416 | 0.62 | 1 | 1000 | 0 | 4758 | 0 |
| notes | 6163 | 0.31 | 1 | 1000 | 0 | 2291 | 0 |
| transit | 3219 | 0.64 | 1 | 1000 | 0 | 4950 | 0 |
| access | 4864 | 0.46 | 1 | 1000 | 0 | 3416 | 0 |
| interaction | 4336 | 0.52 | 1 | 1000 | 0 | 3797 | 0 |
| house_rules | 4852 | 0.46 | 1 | 1000 | 0 | 3323 | 0 |
| picture_url | 0 | 1.00 | 81 | 146 | 0 | 8796 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 6233 | 0 |
| host_name | 4 | 1.00 | 1 | 28 | 0 | 2888 | 0 |
| host_location | 30 | 1.00 | 2 | 70 | 0 | 616 | 0 |
| host_about | 4714 | 0.48 | 1 | 6003 | 0 | 2683 | 11 |
| host_response_time | 4 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 4 | 1.00 | 2 | 4 | 0 | 44 | 0 |
| host_acceptance_rate | 4 | 1.00 | 2 | 4 | 0 | 99 | 0 |
| host_thumbnail_url | 4 | 1.00 | 55 | 106 | 0 | 6211 | 0 |
| host_picture_url | 4 | 1.00 | 57 | 109 | 0 | 6211 | 0 |
| host_neighbourhood | 1814 | 0.80 | 5 | 30 | 0 | 88 | 0 |
| host_verifications | 0 | 1.00 | 2 | 170 | 0 | 210 | 0 |
| street | 0 | 1.00 | 11 | 71 | 0 | 191 | 0 |
| neighbourhood | 0 | 1.00 | 5 | 21 | 0 | 49 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 21 | 0 | 19 | 0 |
| city | 71 | 0.99 | 3 | 22 | 0 | 65 | 0 |
| state | 104 | 0.99 | 4 | 30 | 0 | 42 | 0 |
| market | 12 | 1.00 | 5 | 21 | 0 | 6 | 0 |
| smart_location | 0 | 1.00 | 11 | 39 | 0 | 78 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 4 | 22 | 0 | 26 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1552 | 0 | 8421 | 0 |
| price | 0 | 1.00 | 5 | 9 | 0 | 253 | 0 |
| weekly_price | 7833 | 0.13 | 6 | 9 | 0 | 215 | 0 |
| monthly_price | 8053 | 0.10 | 7 | 9 | 0 | 221 | 0 |
| security_deposit | 3516 | 0.61 | 5 | 9 | 0 | 109 | 0 |
| cleaning_fee | 2561 | 0.72 | 5 | 7 | 0 | 83 | 0 |
| extra_people | 0 | 1.00 | 5 | 7 | 0 | 54 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 77 | 0 |
| cancellation_policy | 0 | 1.00 | 8 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-15 | 2020-06-19 | 2020-06-16 | 4 |
| host_since | 4 | 1.00 | 2008-08-28 | 2020-06-14 | 2015-08-12 | 2577 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-15 | 2020-06-19 | 2020-06-16 | 4 |
| first_review | 1609 | 0.82 | 2010-10-15 | 2020-06-14 | 2018-06-22 | 2040 |
| last_review | 1609 | 0.82 | 2013-03-06 | 2020-06-16 | 2020-01-04 | 1320 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 8986 | 0 | NaN | : |
| medium_url | 8986 | 0 | NaN | : |
| xl_picture_url | 8986 | 0 | NaN | : |
| host_is_superhost | 4 | 1 | 0.18 | FAL: 7400, TRU: 1582 |
| host_has_profile_pic | 4 | 1 | 1.00 | TRU: 8958, FAL: 24 |
| host_identity_verified | 4 | 1 | 0.32 | FAL: 6114, TRU: 2868 |
| neighbourhood_group_cleansed | 8986 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.83 | TRU: 7487, FAL: 1499 |
| square_feet | 8960 | 0 | 0.00 | FAL: 26 |
| has_availability | 0 | 1 | 1.00 | TRU: 8986 |
| requires_license | 0 | 1 | 0.00 | FAL: 8986 |
| license | 8986 | 0 | NaN | : |
| jurisdiction_names | 8986 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.45 | FAL: 4954, TRU: 4032 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 8986 |
| require_guest_profile_picture | 0 | 1 | 0.00 | FAL: 8953, TRU: 33 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 8891, TRU: 95 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.45e+07 | 1.30e+07 | 2.35e+03 | 1.35e+07 | 2.50e+07 | 3.67e+07 | 4.38e+07 | <U+2585><U+2585><U+2585><U+2585><U+2587> |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 8.47e+07 | 9.43e+07 | 2.58e+03 | 1.54e+07 | 4.08e+07 | 1.31e+08 | 3.50e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 4 | 1.00 | 1.19e+01 | 6.47e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+00 | 7.58e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 4 | 1.00 | 1.19e+01 | 6.47e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+00 | 7.58e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| zipcode | 195 | 0.98 | 1.06e+03 | 7.04e+01 | 6.32e+02 | 1.00e+03 | 1.05e+03 | 1.07e+03 | 3.09e+03 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.08e+01 | 2.00e-02 | 5.08e+01 | 5.08e+01 | 5.08e+01 | 5.09e+01 | 5.09e+01 | <U+2581><U+2582><U+2587><U+2583><U+2581> |
| longitude | 0 | 1.00 | 4.36e+00 | 3.00e-02 | 4.26e+00 | 4.35e+00 | 4.36e+00 | 4.38e+00 | 4.49e+00 | <U+2581><U+2586><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1.00 | 3.01e+00 | 1.73e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bathrooms | 4 | 1.00 | 1.18e+00 | 5.70e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 30 | 1.00 | 1.22e+00 | 8.50e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 99 | 0.99 | 1.71e+00 | 1.29e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.10e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.49e+00 | 1.02e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 7.40e+00 | 2.92e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 7.27e+02 | 5.34e+02 | 1.00e+00 | 3.50e+01 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 0 | 1.00 | 7.30e+00 | 2.91e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 0 | 1.00 | 7.66e+00 | 2.94e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 0 | 1.00 | 8.04e+02 | 5.10e+02 | 1.00e+00 | 9.90e+01 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 0 | 1.00 | 8.14e+02 | 5.05e+02 | 1.00e+00 | 1.20e+02 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 0 | 1.00 | 7.49e+00 | 2.92e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 0 | 1.00 | 8.09e+02 | 5.06e+02 | 1.00e+00 | 1.20e+02 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 1.21e+01 | 1.30e+01 | 0.00e+00 | 0.00e+00 | 5.00e+00 | 2.80e+01 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
| availability_60 | 0 | 1.00 | 2.61e+01 | 2.62e+01 | 0.00e+00 | 0.00e+00 | 1.60e+01 | 5.70e+01 | 6.00e+01 | <U+2587><U+2581><U+2581><U+2582><U+2586> |
| availability_90 | 0 | 1.00 | 4.07e+01 | 3.92e+01 | 0.00e+00 | 0.00e+00 | 3.60e+01 | 8.60e+01 | 9.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
| availability_365 | 0 | 1.00 | 1.32e+02 | 1.41e+02 | 0.00e+00 | 0.00e+00 | 8.70e+01 | 2.73e+02 | 3.65e+02 | <U+2587><U+2582><U+2582><U+2581><U+2583> |
| number_of_reviews | 0 | 1.00 | 3.00e+01 | 6.14e+01 | 0.00e+00 | 1.00e+00 | 7.00e+00 | 2.80e+01 | 7.66e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 9.29e+00 | 1.67e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.00e+01 | 1.43e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 1758 | 0.80 | 9.26e+01 | 8.85e+00 | 2.00e+01 | 9.00e+01 | 9.50e+01 | 9.90e+01 | 1.00e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 1766 | 0.80 | 9.55e+00 | 8.70e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 1764 | 0.80 | 9.22e+00 | 1.09e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2582><U+2587> |
| review_scores_checkin | 1766 | 0.80 | 9.67e+00 | 7.60e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 1764 | 0.80 | 9.64e+00 | 8.00e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 1768 | 0.80 | 9.51e+00 | 7.90e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 1767 | 0.80 | 9.25e+00 | 9.50e-01 | 2.00e+00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 5.94e+00 | 1.44e+01 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+00 | 8.50e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.64e+00 | 1.36e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 8.50e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.18e+00 | 4.76e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 5.30e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 5.00e-02 | 7.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.50e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 1609 | 0.82 | 1.31e+00 | 1.72e+00 | 1.00e-02 | 2.10e-01 | 6.00e-01 | 1.65e+00 | 1.33e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Overall, there are 106 variables and 8986 observations in the initial data. Among all of those, 39 variables are stored as numbers, including the IDs, host listings counts, review scores ratings,number of reviews, number of rooms and beds, accommodates, number of guests, longitude, latitude etc.
What’s more, we could get the categorical factors from the results above, such as the experiences_offered, host_location, host_is_superhost, host_has_profile_pic, host_identity_verified, neighborhood_cleansed, city, state, market, smart_location, country, is_location exact, property_type, room_type, bed_type, has_availability, requires_license, instant_bookable, is_business_travel_ready, require_guest_profile_picture, require_guest_phone_verification etc. Basically, these are the factors related to the host, the location and other requirements.
We notice that all the value for price is stored as characters rather than numbers. Therefore, we change the those values into numbers.
#Data wrangling
#change the price values from strings to numbers
listings <- listings %>%
mutate(price = parse_number(price),
weekly_price=parse_number(weekly_price),
monthly_price=parse_number(monthly_price),
security_deposit=parse_number(security_deposit),
cleaning_fee = parse_number(cleaning_fee),
extra_people=parse_number(extra_people))
skimr::skim(listings)| Name | listings |
| Number of rows | 8986 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 39 |
| Date | 5 |
| logical | 17 |
| numeric | 45 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 8986 | 0 |
| name | 12 | 1.00 | 1 | 217 | 0 | 8815 | 0 |
| summary | 409 | 0.95 | 1 | 1000 | 0 | 8056 | 0 |
| space | 3216 | 0.64 | 1 | 1000 | 0 | 5374 | 0 |
| description | 291 | 0.97 | 1 | 1000 | 0 | 8431 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 3416 | 0.62 | 1 | 1000 | 0 | 4758 | 0 |
| notes | 6163 | 0.31 | 1 | 1000 | 0 | 2291 | 0 |
| transit | 3219 | 0.64 | 1 | 1000 | 0 | 4950 | 0 |
| access | 4864 | 0.46 | 1 | 1000 | 0 | 3416 | 0 |
| interaction | 4336 | 0.52 | 1 | 1000 | 0 | 3797 | 0 |
| house_rules | 4852 | 0.46 | 1 | 1000 | 0 | 3323 | 0 |
| picture_url | 0 | 1.00 | 81 | 146 | 0 | 8796 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 6233 | 0 |
| host_name | 4 | 1.00 | 1 | 28 | 0 | 2888 | 0 |
| host_location | 30 | 1.00 | 2 | 70 | 0 | 616 | 0 |
| host_about | 4714 | 0.48 | 1 | 6003 | 0 | 2683 | 11 |
| host_response_time | 4 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 4 | 1.00 | 2 | 4 | 0 | 44 | 0 |
| host_acceptance_rate | 4 | 1.00 | 2 | 4 | 0 | 99 | 0 |
| host_thumbnail_url | 4 | 1.00 | 55 | 106 | 0 | 6211 | 0 |
| host_picture_url | 4 | 1.00 | 57 | 109 | 0 | 6211 | 0 |
| host_neighbourhood | 1814 | 0.80 | 5 | 30 | 0 | 88 | 0 |
| host_verifications | 0 | 1.00 | 2 | 170 | 0 | 210 | 0 |
| street | 0 | 1.00 | 11 | 71 | 0 | 191 | 0 |
| neighbourhood | 0 | 1.00 | 5 | 21 | 0 | 49 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 21 | 0 | 19 | 0 |
| city | 71 | 0.99 | 3 | 22 | 0 | 65 | 0 |
| state | 104 | 0.99 | 4 | 30 | 0 | 42 | 0 |
| market | 12 | 1.00 | 5 | 21 | 0 | 6 | 0 |
| smart_location | 0 | 1.00 | 11 | 39 | 0 | 78 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 4 | 22 | 0 | 26 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1552 | 0 | 8421 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 77 | 0 |
| cancellation_policy | 0 | 1.00 | 8 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-15 | 2020-06-19 | 2020-06-16 | 4 |
| host_since | 4 | 1.00 | 2008-08-28 | 2020-06-14 | 2015-08-12 | 2577 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-15 | 2020-06-19 | 2020-06-16 | 4 |
| first_review | 1609 | 0.82 | 2010-10-15 | 2020-06-14 | 2018-06-22 | 2040 |
| last_review | 1609 | 0.82 | 2013-03-06 | 2020-06-16 | 2020-01-04 | 1320 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 8986 | 0 | NaN | : |
| medium_url | 8986 | 0 | NaN | : |
| xl_picture_url | 8986 | 0 | NaN | : |
| host_is_superhost | 4 | 1 | 0.18 | FAL: 7400, TRU: 1582 |
| host_has_profile_pic | 4 | 1 | 1.00 | TRU: 8958, FAL: 24 |
| host_identity_verified | 4 | 1 | 0.32 | FAL: 6114, TRU: 2868 |
| neighbourhood_group_cleansed | 8986 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.83 | TRU: 7487, FAL: 1499 |
| square_feet | 8960 | 0 | 0.00 | FAL: 26 |
| has_availability | 0 | 1 | 1.00 | TRU: 8986 |
| requires_license | 0 | 1 | 0.00 | FAL: 8986 |
| license | 8986 | 0 | NaN | : |
| jurisdiction_names | 8986 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.45 | FAL: 4954, TRU: 4032 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 8986 |
| require_guest_profile_picture | 0 | 1 | 0.00 | FAL: 8953, TRU: 33 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 8891, TRU: 95 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.45e+07 | 1.30e+07 | 2.35e+03 | 1.35e+07 | 2.50e+07 | 3.67e+07 | 4.38e+07 | <U+2585><U+2585><U+2585><U+2585><U+2587> |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 8.47e+07 | 9.43e+07 | 2.58e+03 | 1.54e+07 | 4.08e+07 | 1.31e+08 | 3.50e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 4 | 1.00 | 1.19e+01 | 6.47e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+00 | 7.58e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 4 | 1.00 | 1.19e+01 | 6.47e+01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+00 | 7.58e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| zipcode | 195 | 0.98 | 1.06e+03 | 7.04e+01 | 6.32e+02 | 1.00e+03 | 1.05e+03 | 1.07e+03 | 3.09e+03 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.08e+01 | 2.00e-02 | 5.08e+01 | 5.08e+01 | 5.08e+01 | 5.09e+01 | 5.09e+01 | <U+2581><U+2582><U+2587><U+2583><U+2581> |
| longitude | 0 | 1.00 | 4.36e+00 | 3.00e-02 | 4.26e+00 | 4.35e+00 | 4.36e+00 | 4.38e+00 | 4.49e+00 | <U+2581><U+2586><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1.00 | 3.01e+00 | 1.73e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bathrooms | 4 | 1.00 | 1.18e+00 | 5.70e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 30 | 1.00 | 1.22e+00 | 8.50e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 99 | 0.99 | 1.71e+00 | 1.29e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.10e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| price | 0 | 1.00 | 7.79e+01 | 1.53e+02 | 0.00e+00 | 4.00e+01 | 6.00e+01 | 8.50e+01 | 8.94e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| weekly_price | 7833 | 0.13 | 4.18e+02 | 2.82e+02 | 7.00e+01 | 2.50e+02 | 3.50e+02 | 5.00e+02 | 3.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| monthly_price | 8053 | 0.10 | 1.38e+03 | 1.03e+03 | 3.15e+02 | 7.50e+02 | 1.10e+03 | 1.65e+03 | 9.95e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| security_deposit | 3516 | 0.61 | 2.15e+02 | 4.30e+02 | 0.00e+00 | 0.00e+00 | 1.00e+02 | 2.50e+02 | 4.74e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| cleaning_fee | 2561 | 0.72 | 2.76e+01 | 2.66e+01 | 0.00e+00 | 1.00e+01 | 2.00e+01 | 4.00e+01 | 5.05e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.49e+00 | 1.02e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| extra_people | 0 | 1.00 | 7.35e+00 | 1.20e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.20e+01 | 2.69e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 7.40e+00 | 2.92e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 7.27e+02 | 5.34e+02 | 1.00e+00 | 3.50e+01 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 0 | 1.00 | 7.30e+00 | 2.91e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 0 | 1.00 | 7.66e+00 | 2.94e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 0 | 1.00 | 8.04e+02 | 5.10e+02 | 1.00e+00 | 9.90e+01 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 0 | 1.00 | 8.14e+02 | 5.05e+02 | 1.00e+00 | 1.20e+02 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 0 | 1.00 | 7.49e+00 | 2.92e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 0 | 1.00 | 8.09e+02 | 5.06e+02 | 1.00e+00 | 1.20e+02 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 1.21e+01 | 1.30e+01 | 0.00e+00 | 0.00e+00 | 5.00e+00 | 2.80e+01 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2585> |
| availability_60 | 0 | 1.00 | 2.61e+01 | 2.62e+01 | 0.00e+00 | 0.00e+00 | 1.60e+01 | 5.70e+01 | 6.00e+01 | <U+2587><U+2581><U+2581><U+2582><U+2586> |
| availability_90 | 0 | 1.00 | 4.07e+01 | 3.92e+01 | 0.00e+00 | 0.00e+00 | 3.60e+01 | 8.60e+01 | 9.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
| availability_365 | 0 | 1.00 | 1.32e+02 | 1.41e+02 | 0.00e+00 | 0.00e+00 | 8.70e+01 | 2.73e+02 | 3.65e+02 | <U+2587><U+2582><U+2582><U+2581><U+2583> |
| number_of_reviews | 0 | 1.00 | 3.00e+01 | 6.14e+01 | 0.00e+00 | 1.00e+00 | 7.00e+00 | 2.80e+01 | 7.66e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 9.29e+00 | 1.67e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.00e+01 | 1.43e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 1758 | 0.80 | 9.26e+01 | 8.85e+00 | 2.00e+01 | 9.00e+01 | 9.50e+01 | 9.90e+01 | 1.00e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 1766 | 0.80 | 9.55e+00 | 8.70e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 1764 | 0.80 | 9.22e+00 | 1.09e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2582><U+2587> |
| review_scores_checkin | 1766 | 0.80 | 9.67e+00 | 7.60e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 1764 | 0.80 | 9.64e+00 | 8.00e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 1768 | 0.80 | 9.51e+00 | 7.90e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 1767 | 0.80 | 9.25e+00 | 9.50e-01 | 2.00e+00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 5.94e+00 | 1.44e+01 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+00 | 8.50e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.64e+00 | 1.36e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 8.50e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.18e+00 | 4.76e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 5.30e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 5.00e-02 | 7.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.50e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 1609 | 0.82 | 1.31e+00 | 1.72e+00 | 1.00e-02 | 2.10e-01 | 6.00e-01 | 1.65e+00 | 1.33e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
We select the variables we believe are the most relevant to our analysis from the perspective of hosts, location, property and reviews, and then filter out those could only accommodate 1 because we have 2 persons.
listings <- listings%>%
# we select the variables we believe are the most relevant to our analysis.
select(#id
listing_url,
#hosts
host_id,
host_since,
host_response_rate,
host_response_time,
host_acceptance_rate,
host_is_superhost,
host_has_profile_pic,
host_identity_verified,
#location
neighbourhood_cleansed,
latitude,
longitude,
is_location_exact,
neighbourhood,
#property
property_type,
room_type,
accommodates,
bathrooms,
bedrooms,
beds,
price,
security_deposit,
cleaning_fee,
guests_included,
extra_people,
minimum_nights,
maximum_nights,
#reviews
number_of_reviews,
review_scores_rating,
review_scores_accuracy,
review_scores_checkin,
review_scores_cleanliness,
review_scores_communication,
review_scores_location,
review_scores_value,
cancellation_policy)%>%
#filter for those accommodate more than 1
filter(accommodates>1)
skim(listings)| Name | listings |
| Number of rows | 8203 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| Date | 1 |
| logical | 4 |
| numeric | 22 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1 | 33 | 37 | 0 | 8203 | 0 |
| host_response_rate | 4 | 1 | 2 | 4 | 0 | 44 | 0 |
| host_response_time | 4 | 1 | 3 | 18 | 0 | 5 | 0 |
| host_acceptance_rate | 4 | 1 | 2 | 4 | 0 | 98 | 0 |
| neighbourhood_cleansed | 0 | 1 | 5 | 21 | 0 | 19 | 0 |
| neighbourhood | 0 | 1 | 5 | 21 | 0 | 49 | 0 |
| property_type | 0 | 1 | 4 | 22 | 0 | 22 | 0 |
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| cancellation_policy | 0 | 1 | 8 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 4 | 1 | 2008-08-28 | 2020-06-14 | 2015-08-13 | 2521 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 4 | 1 | 0.18 | FAL: 6754, TRU: 1445 |
| host_has_profile_pic | 4 | 1 | 1.00 | TRU: 8179, FAL: 20 |
| host_identity_verified | 4 | 1 | 0.32 | FAL: 5584, TRU: 2615 |
| is_location_exact | 0 | 1 | 0.84 | TRU: 6863, FAL: 1340 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| host_id | 0 | 1.00 | 8.44e+07 | 9.42e+07 | 2582.00 | 1.53e+07 | 4.11e+07 | 1.30e+08 | 3.50e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.08e+01 | 2.00e-02 | 50.77 | 5.08e+01 | 5.08e+01 | 5.09e+01 | 5.09e+01 | <U+2581><U+2582><U+2587><U+2583><U+2581> |
| longitude | 0 | 1.00 | 4.36e+00 | 3.00e-02 | 4.26 | 4.35e+00 | 4.36e+00 | 4.38e+00 | 4.49e+00 | <U+2581><U+2586><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1.00 | 3.21e+00 | 1.69e+00 | 2.00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bathrooms | 2 | 1.00 | 1.18e+00 | 5.70e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 25 | 1.00 | 1.24e+00 | 8.80e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 70 | 0.99 | 1.77e+00 | 1.32e+00 | 0.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.10e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| price | 0 | 1.00 | 8.02e+01 | 1.51e+02 | 0.00 | 4.50e+01 | 6.00e+01 | 8.90e+01 | 8.94e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| security_deposit | 3102 | 0.62 | 2.17e+02 | 4.36e+02 | 0.00 | 0.00e+00 | 1.00e+02 | 2.50e+02 | 4.74e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| cleaning_fee | 2211 | 0.73 | 2.82e+01 | 2.69e+01 | 0.00 | 1.00e+01 | 2.00e+01 | 4.00e+01 | 5.05e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1.00 | 1.53e+00 | 1.05e+00 | 1.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| extra_people | 0 | 1.00 | 7.57e+00 | 1.18e+01 | 0.00 | 0.00e+00 | 0.00e+00 | 1.50e+01 | 2.69e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 6.74e+00 | 2.90e+01 | 1.00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.00e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 7.36e+02 | 5.34e+02 | 1.00 | 3.55e+01 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 3.12e+01 | 6.27e+01 | 0.00 | 1.00e+00 | 8.00e+00 | 2.90e+01 | 7.66e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 1507 | 0.82 | 9.27e+01 | 8.67e+00 | 20.00 | 9.00e+01 | 9.50e+01 | 9.90e+01 | 1.00e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 1515 | 0.82 | 9.55e+00 | 8.60e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 1515 | 0.82 | 9.67e+00 | 7.50e-01 | 2.00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 1513 | 0.82 | 9.23e+00 | 1.07e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2582><U+2587> |
| review_scores_communication | 1513 | 0.82 | 9.65e+00 | 7.80e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 1517 | 0.82 | 9.51e+00 | 7.80e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 1516 | 0.82 | 9.26e+00 | 9.30e-01 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
As we can see from the results above, there are 2211 missing values in cleaning_fee and 3102 missing values in Security_deposit, which indicates that the cleaning fee and the security deposit are not applicable in those cases. Therefore, we could change the NAs into 0 as follows.
#handling NAs in cleaning fees and security deposit
listings <- listings %>%
mutate(cleaning_fee = case_when(
is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee),
security_deposit = case_when(
is.na(security_deposit) ~ 0,
TRUE ~ cleaning_fee))Also, we found that there are some observations with missing values in bedroom and bathroom. It could be something like a studio with no extra bedroom or bathroom in the property. Therefore, we change those NAs into 0 as well.
#handling NAs in bedrooms and bathrooms
listings <- listings %>%
mutate(bedrooms = case_when(
is.na(bedrooms) ~ 0,
TRUE ~ bedrooms),
bathrooms = case_when(
is.na(bathrooms) ~ 0,
TRUE ~ bathrooms))As for those missing values in host_since and the reviews, we just filter those things out.
#filter out those with NAs in host_since and the review data
listings <- listings %>%
drop_na(host_since,
review_scores_rating,
review_scores_accuracy,
review_scores_checkin,
review_scores_cleanliness,
review_scores_communication,
review_scores_location,
review_scores_value)After that, we also simplified the property types into 5 categories. 4892 of the properties are apartments, which is 73.22% of the total listings in Brussels. There are 529 houses and 314 condominiums on Airbnb as well. In this case, we keep the top 4 types of properties—apartment,house,condominium,townhouse–and classify the other types by a new category called “Others” to construct the simplified property type as follows.
#find the top 4 property types
top_4_p<- listings%>%
count(property_type)%>%
arrange(desc(n))%>%
mutate(proportion=n/sum(n))
top_4_p| property_type | n | proportion |
|---|---|---|
| Apartment | 4892 | 0.732 |
| House | 529 | 0.0792 |
| Condominium | 314 | 0.047 |
| Townhouse | 265 | 0.0397 |
| Loft | 251 | 0.0376 |
| Serviced apartment | 197 | 0.0295 |
| Bed and breakfast | 90 | 0.0135 |
| Guesthouse | 40 | 0.00599 |
| Guest suite | 30 | 0.00449 |
| Hotel | 26 | 0.00389 |
| Boutique hotel | 15 | 0.00225 |
| Villa | 9 | 0.00135 |
| Aparthotel | 7 | 0.00105 |
| Other | 7 | 0.00105 |
| Cottage | 2 | 0.000299 |
| Nature lodge | 2 | 0.000299 |
| Casa particular (Cuba) | 1 | 0.00015 |
| Earth house | 1 | 0.00015 |
| Farm stay | 1 | 0.00015 |
| Hostel | 1 | 0.00015 |
| Tiny house | 1 | 0.00015 |
#apply simplified property types into 5 categories
listings <- listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Apartment","House", "Condominium","Townhouse") ~ property_type,
TRUE ~ "Other"))
category_5_p<-listings%>%
count(prop_type_simplified)%>%
mutate(prop_type_simplified=factor(prop_type_simplified,order=TRUE,levels = c("Apartment","House", "Condominium","Townhouse","Other")),
proportion=n/sum(n))
ggplot(category_5_p,aes(x=prop_type_simplified,y=n))+
geom_col(fill="Pink")+
labs(title="Simplified property type distribution",
subtitle="Simplified property types on Airbnb in Brussels",
x="Simplified Property Type",
y="Numbers")+
theme_bw()+
NULLAs for the minimum nights, Airbnb is most commonly used for travel purposes, and we only want to include listings in our regression analysis that are intended for travel purposes. Therefore, we took a look at the most common values for minimum nights in Brussels.
#find the most common values for minimum_night
min_n <- listings%>%
count(minimum_nights)%>%
arrange(desc(n))%>%
mutate(proportion=n/sum(n))
top10_min_n<-min_n%>%
head(10)
#The most common values for minimum_night
ggplot(top10_min_n,aes(x=reorder(minimum_nights,-n),y=n))+
geom_col(fill="pink")+
labs(title="Most common minimum-nights requirements",
subtitle="Top 10 minimum-nights required to book a property on Airbnb in Brussels",
x="Value of minimum nights",
y="Numbers")+
theme_bw()+
NULL#overall density
ggplot(min_n,aes(x=minimum_nights))+
geom_density(fill="pink",color="red")+
labs(title="Minimum nights distribution",
subtitle="minimum nights required to book a property on Airbnb in Brussels",
x="Minimum nights",
y="Density")+
theme_bw()+
NULLThe most common values for the variable minimum_nights are 1 and 2 days, the majority of them are within 7 days. Among the top 10 most common values, we could see that the minimum required nights of 30 days and 90 days are somehow more common than that of 6 days, which is much longer than what we expected for travel purposes. This probably suggests that Airbnb plays the same role as the housing agents that provide short-rent houses in Brussels. The overall distribution is right skewed as well, which shows the same conclusion.
Since we are going to estimate the cost of 2 people staying for 4 nights, we filter the Airbnb data so that it only includes observations with a minimum nights less than 4 and get rid of the outliers. Then we could get the data that we are going to use in the regression model.
| Name | listings |
| Number of rows | 5670 |
| Number of columns | 37 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| Date | 1 |
| logical | 4 |
| numeric | 22 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1 | 33 | 37 | 0 | 5670 | 0 |
| host_response_rate | 0 | 1 | 2 | 4 | 0 | 41 | 0 |
| host_response_time | 0 | 1 | 3 | 18 | 0 | 5 | 0 |
| host_acceptance_rate | 0 | 1 | 2 | 4 | 0 | 96 | 0 |
| neighbourhood_cleansed | 0 | 1 | 5 | 21 | 0 | 19 | 0 |
| neighbourhood | 0 | 1 | 5 | 21 | 0 | 49 | 0 |
| property_type | 0 | 1 | 4 | 18 | 0 | 20 | 0 |
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| cancellation_policy | 0 | 1 | 8 | 27 | 0 | 4 | 0 |
| prop_type_simplified | 0 | 1 | 5 | 11 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 0 | 1 | 2008-08-28 | 2020-05-25 | 2015-08-29 | 2174 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 0 | 1 | 0.21 | FAL: 4488, TRU: 1182 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 5658, FAL: 12 |
| host_identity_verified | 0 | 1 | 0.31 | FAL: 3933, TRU: 1737 |
| is_location_exact | 0 | 1 | 0.85 | TRU: 4811, FAL: 859 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| host_id | 0 | 1 | 8.54e+07 | 9.28e+07 | 2582.00 | 1.58e+07 | 4.27e+07 | 1.33e+08 | 3.48e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| latitude | 0 | 1 | 5.08e+01 | 2.00e-02 | 50.77 | 5.08e+01 | 5.08e+01 | 5.09e+01 | 5.09e+01 | <U+2581><U+2582><U+2587><U+2583><U+2581> |
| longitude | 0 | 1 | 4.36e+00 | 3.00e-02 | 4.26 | 4.34e+00 | 4.36e+00 | 4.38e+00 | 4.47e+00 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1 | 3.25e+00 | 1.76e+00 | 2.00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bathrooms | 0 | 1 | 1.17e+00 | 4.60e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 0 | 1 | 1.22e+00 | 8.10e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 25 | 1 | 1.80e+00 | 1.33e+00 | 0.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| price | 0 | 1 | 7.69e+01 | 7.89e+01 | 0.00 | 4.50e+01 | 6.00e+01 | 8.90e+01 | 3.50e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| security_deposit | 0 | 1 | 1.77e+01 | 2.42e+01 | 0.00 | 0.00e+00 | 1.00e+01 | 3.00e+01 | 5.05e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| cleaning_fee | 0 | 1 | 2.08e+01 | 2.40e+01 | 0.00 | 0.00e+00 | 1.50e+01 | 3.00e+01 | 5.05e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| guests_included | 0 | 1 | 1.61e+00 | 1.10e+00 | 1.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| extra_people | 0 | 1 | 8.53e+00 | 1.20e+01 | 0.00 | 0.00e+00 | 5.00e+00 | 1.50e+01 | 2.50e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1 | 1.90e+00 | 8.80e-01 | 1.00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | <U+2587><U+2587><U+2581><U+2583><U+2581> |
| maximum_nights | 0 | 1 | 7.27e+02 | 5.34e+02 | 1.00 | 3.10e+01 | 1.12e+03 | 1.12e+03 | 1.11e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1 | 4.20e+01 | 7.13e+01 | 1.00 | 5.00e+00 | 1.40e+01 | 4.50e+01 | 7.66e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 0 | 1 | 9.27e+01 | 8.45e+00 | 20.00 | 9.00e+01 | 9.50e+01 | 9.80e+01 | 1.00e+02 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 0 | 1 | 9.55e+00 | 8.40e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 0 | 1 | 9.68e+00 | 7.30e-01 | 2.00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 0 | 1 | 9.23e+00 | 1.06e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2582><U+2587> |
| review_scores_communication | 0 | 1 | 9.66e+00 | 7.50e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 0 | 1 | 9.52e+00 | 7.80e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 0 | 1 | 9.26e+00 | 9.00e-01 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
And we are going to explore the data from the perspectives of host, location, property, reviews and cancellation policy separately.
First of all, we would like to explore how long have the hosts been on Airbnb.
#we could get the latest date of all the scrape in checking the listings_csv file and found it to be "2020-06-19"
#caculate the year of host experience
year_host_experience<- listings%>%
mutate(year_host=as.numeric(ymd("2020-06-19")-host_since)/365)
#statistical summary of host experience
year_host<-favstats(~year_host, data=year_host_experience)
year_host%>%
kbl() %>%
kable_styling()| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0.068 | 3.04 | 4.81 | 6.06 | 11.8 | 4.59 | 2.19 | 5670 | 0 |
#plot the distribution of host experience
ggplot(year_host_experience,aes(x=year_host))+
geom_density(fill="pink",color="red")+
labs(title="The majority of the hosts had more than 3 years of experience on Airbnb",
subtitle="in Brussels",
x="Years of host experience",
y="Density")+
theme_bw()+
NULLOverall, the hosts had an average of 4.6 years of experience in hosting on Airbnb. It is slightly left-skewed with a median of 4.81. The majority of hosts had an experience over 3 years.
Then we look at the response time of the hosts. Although this variable is not applicable to 49.9% of the Airbnb hosts, 28.57% of them will response within one hour. And only less than 4% of the hosts have a response time longer than one day, which suggests that the quality of Airbnb hosts in Brussels are high.
#distribution of host response time
listings<-listings%>%
mutate(host_response_time=factor(host_response_time,order=TRUE,levels=c("within an hour", "within a few hours", "within a day","a few days or more")))
#summary of the response time
res_time <-listings%>%
group_by(host_response_time)%>%
summarise(num=n())%>%
mutate(proportion=num/sum(num))
res_time%>%
kbl() %>%
kable_styling()| host_response_time | num | proportion |
|---|---|---|
| within an hour | 1620 | 0.286 |
| within a few hours | 565 | 0.100 |
| within a day | 434 | 0.077 |
| a few days or more | 220 | 0.039 |
| NA | 2831 | 0.499 |
#plot the distribution of host response time
ggplot(res_time,aes(x=host_response_time,y=num))+
geom_col(fill="pink")+
labs(title="Distribution of host response time",
x="Host response time",
y="Numbers")+
theme_bw()+
NULLAfter that, we checked the relationship between host years and host response time to see if those who response quick would have longer host experience on Airbnb. Yet the result didn’t reflect this relationship.
#reorder the host response time in year host experience
year_host_experience<-year_host_experience%>%
mutate(host_response_time=factor(host_response_time,order=TRUE,levels=c("within an hour", "within a few hours", "within a day","a few days or more")))
#check the relationship between response time and host years
ggplot(year_host_experience,aes(x=host_response_time, y=year_host)) +
geom_boxplot(fill="pink",color="red") +
labs(title="Quick response doesn't mean longer years of host experience",
x="Host response time",
y="Years of host")+
theme_bw()+
NULLThen we take a look at another factor that could reflected the quality of hosts–whether the host is a super host–and check whether there’s any correlation between super hosts and others. We found that 20.8% of the hosts are super hosts on Airbnb in Brussels. Still we cannot say that super hosts would response faster than other hosts.
#summary of super hosts
super_hosts<- listings%>%
group_by(host_is_superhost,host_response_time)%>%
summarise(num=n())%>%
mutate(proportion_within_the_group=num/sum(num))
super_hosts%>%
kbl() %>%
kable_styling()| host_is_superhost | host_response_time | num | proportion_within_the_group |
|---|---|---|---|
| FALSE | within an hour | 1162 | 0.259 |
| FALSE | within a few hours | 391 | 0.087 |
| FALSE | within a day | 294 | 0.066 |
| FALSE | a few days or more | 189 | 0.042 |
| FALSE | NA | 2452 | 0.546 |
| TRUE | within an hour | 458 | 0.387 |
| TRUE | within a few hours | 174 | 0.147 |
| TRUE | within a day | 140 | 0.118 |
| TRUE | a few days or more | 31 | 0.026 |
| TRUE | NA | 379 | 0.321 |
#plot the distribution of host response time
ggplot(super_hosts,aes(x=reorder(host_is_superhost,num),y=num))+
geom_col(fill="pink")+
labs(title="There are 1182 super hosts on Airbnb in Brussels",
x="Host is a super host",
y="Numbers")+
theme_bw()+
NULL#Plot to see the relationship between super hosts and other hosts
ggplot(listings,aes(x=host_response_time, y=count(host_response_time))) +
geom_col(fill="pink") +
facet_wrap(~host_is_superhost,ncol=1)+
labs(title="Super hosts didn't seem to respond more quickly than other hosts",
x="Host response time",
y="Numbers")+
theme_bw()+
NULLNext we take a look at the locations of property. According to the data, there are 19 neighborhoods covered by the listings on Airbnb. And 1891 listings are in Bruxelles, which is 33% of the total listings in Brussels. The top 3 neighborhoods are Bruxelles, Ixelles and Saint-Gilles.
#summary of the neighborhood covered by Airbnb listings in Brussels
su_locations<-listings%>%
group_by(neighbourhood_cleansed)%>%
summarise(num=n())%>%
mutate(proportion=num/sum(num))%>%
arrange(desc(num))
su_locations%>%
kbl() %>%
kable_styling()| neighbourhood_cleansed | num | proportion |
|---|---|---|
| Bruxelles | 1891 | 0.334 |
| Ixelles | 933 | 0.165 |
| Saint-Gilles | 622 | 0.110 |
| Schaerbeek | 372 | 0.066 |
| Etterbeek | 280 | 0.049 |
| Forest | 268 | 0.047 |
| Anderlecht | 259 | 0.046 |
| Uccle | 248 | 0.044 |
| Molenbeek-Saint-Jean | 149 | 0.026 |
| Saint-Josse-ten-Noode | 121 | 0.021 |
| Woluwe-Saint-Lambert | 120 | 0.021 |
| Auderghem | 86 | 0.015 |
| Woluwe-Saint-Pierre | 75 | 0.013 |
| Jette | 73 | 0.013 |
| Watermael-Boitsfort | 56 | 0.010 |
| Evere | 39 | 0.007 |
| Koekelberg | 32 | 0.006 |
| Berchem-Sainte-Agathe | 25 | 0.004 |
| Ganshoren | 21 | 0.004 |
#plot the number of listings in different neighborhoods
ggplot(su_locations,aes(x=num,y=reorder(neighbourhood_cleansed,num)))+
geom_col(fill="pink")+
labs(title="Airbnb listings in different neighbourhoods in Brussels",
x="Number of listings",
y=NULL)+
theme_bw()+
NULLAnd then we take a look at the accuracy of the locations shown on Airbnb. Almost 85% of the listings show the exact same location as it is in Brussels, which suggests that the level of accuracy is quite high.
#summary of the location accuracy
location_accu<-listings%>%
group_by(is_location_exact)%>%
summarise(num=n())%>%
mutate(proportion=num/sum(num))%>%
arrange(desc(num))
location_accu%>%
kbl() %>%
kable_styling()| is_location_exact | num | proportion |
|---|---|---|
| TRUE | 4811 | 0.849 |
| FALSE | 859 | 0.151 |
#plot the accuracy of locations
ggplot(location_accu,aes(x=reorder(is_location_exact,-num),y=num))+
geom_col(fill="pink")+
labs(title="The accuracy of the location shown on Airbnb listings in Brussels",
x="Is the location exact the same as shown?",
y="Number of listings")+
theme_bw()+
NULLIn this part, we look at the property type, room type and other features.
# summary of simplified property types after filters
category_5_p2<-listings%>%
count(prop_type_simplified)%>%
mutate(prop_type_simplified=factor(prop_type_simplified,order=TRUE,levels = c("Apartment","House", "Condominium","Townhouse","Other")),
proportion=n/sum(n))
#plot the number of listings in different property types
ggplot(category_5_p2,aes(x=prop_type_simplified,y=n))+
geom_col(fill="Pink")+
labs(title="The majority of Airbnb listings in Brussels are apartments",
x="Simplified proper types",
y="Number of listings")+
theme_bw()+
NULLThen we take a look at the room type. 69% of the listings are entire home/apt and 29% of the listings are private rooms. There are only a few hotel rooms and shared rooms, which shows a high level of overall privacy. Besides
#summary of room types
r_type1<-listings%>%
group_by(room_type)%>%
summarise(num=n())%>%
mutate(proportion=num/sum(num))%>%
arrange(desc(num))
r_type1%>%
kbl() %>%
kable_styling()| room_type | num | proportion |
|---|---|---|
| Entire home/apt | 3914 | 0.690 |
| Private room | 1667 | 0.294 |
| Hotel room | 59 | 0.010 |
| Shared room | 30 | 0.005 |
#plot the distribution of rooms types
ggplot(r_type1,aes(x=num, y=reorder(room_type,num))) +
geom_col(fill="pink") +
labs(title="Airbnb listings in Brussels has a high level of privacy",
subtitle = "distribution of different room types",
x="Number of listings",
y=NULL)+
theme_bw()+
NULLAlso we could see the distribution of bedrooms and bathrooms among the Airbnb listings in Brussels. The most common value for both bedrooms and bathrooms is 1.
#bedrooms
#statistical summary of bedrooms
bed_r<-favstats(~bedrooms, data=listings)
bed_r%>%
kbl() %>%
kable_styling()| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 10 | 1.22 | 0.808 | 5670 | 0 |
#plot the distribution of bedrooms
ggplot(listings,aes(x=bedrooms))+
geom_density(fill="pink",color="red")+
labs(title="The majority of the listings have one bedroom",
x="Number of bedrooms",
y="Density")+
scale_x_continuous(breaks = seq(0,10,1), limits = c(0,10))+
theme_bw()+
NULL#bathrooms
#statistical summary of bathrooms
bath_r<-favstats(~bathrooms, data=listings)
bath_r%>%
kbl() %>%
kable_styling()| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 11 | 1.17 | 0.461 | 5670 | 0 |
#plot the distribution of bathrooms
ggplot(listings,aes(x=bathrooms))+
geom_density(fill="pink",color="red")+
labs(title="The majority of the listings have one bathroom",
x="Number of bathrooms",
y="Density")+
scale_x_continuous(breaks = seq(0,10,1), limits = c(0,10))+
theme_bw()+
NULLThe average accommodates is 3.25 persons, and over 50% of listings accommodates for exact 2 people. Besides, those could accommodate even number of people are more than those accommodates odd number of people.
#accommodates
#statistical summary of accommodates
acco_n<-favstats(~accommodates, data=listings)
acco_n%>%
kbl() %>%
kable_styling()| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 2 | 2 | 2 | 4 | 16 | 3.25 | 1.76 | 5670 | 0 |
#plot the distribution of accommodates
ggplot(listings,aes(x=accommodates))+
geom_density(fill="pink",color="red")+
labs(title="Over 50% of listings accommodates 2 people",
x="Accommodates",
y="Density")+
scale_x_continuous(breaks = seq(0,16,2), limits = c(0,16))+
theme_bw()+
NULLOverall, the average price is 76.9 dollars and the median price is 60 dollars. But the maximum price reached 3500 dollars, which led to a positive skewed price distribution.
#price
#statistical summary of price
summary_price<-favstats(~price, data=listings)
summary_price%>%
kbl() %>%
kable_styling()| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | 60 | 89 | 3500 | 76.9 | 78.9 | 5670 | 0 |
#plot the distribution of price
ggplot(listings,aes(x=price))+
geom_density(fill="pink",color="red")+
labs(title="Distribution of price is skewed right",
x="Price",
y="Density")+
theme_bw()+
scale_x_continuous(breaks = seq(0,1000,50), limits = c(0,1000))+
NULLThe average number of reviews is 42, while the median is only 14, which suggests a right-skewed distribution. Meanwhile, the maximum number of reviews is 766, which is far beyond the average.
#statistical summary of number_of_reviews
su_review_numbers<-favstats(~number_of_reviews, data=listings)
su_review_numbers%>%
kbl() %>%
kable_styling()| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 5 | 14 | 45 | 766 | 42 | 71.3 | 5670 | 0 |
#plot the distribution of number_of_reviews
ggplot(listings,aes(x=number_of_reviews))+
geom_density(fill="pink",color="red")+
labs(title="Distribution of number of reviews is skewed right",
x="numbers of reviews",
y="Density")+
theme_bw()+
scale_x_continuous(breaks = seq(0,800,100), limits = c(0,800))+
NULLThe average rating score is 92.7 and the median rating score is 95, which suggests high satisfaction from customers. The distribution of rating scores is skewed left.
#statistical summary of reviews_scores_rating
su_review_rating<-favstats(~review_scores_rating, data=listings)
su_review_rating%>%
kbl() %>%
kable_styling()| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 20 | 90 | 95 | 98 | 100 | 92.7 | 8.45 | 5670 | 0 |
#plot the distribution of reviews_scores_rating
ggplot(listings,aes(x=review_scores_rating))+
geom_density(fill="pink",color="red")+
labs(title="Distribution of rating scores is skewed left",
x="Rating scores",
y="Density")+
theme_bw()+
scale_x_continuous(breaks = seq(0,100,25), limits = c(0,100))+
NULLTo break down the rating scores, we also take a look at the scores on different aspects.
#plot the distribution of different rating scores
# plot the distribution of scores on accuracy
p_accuracy <- listings%>%
ggplot(aes(x=review_scores_accuracy)) +
geom_density(fill="pink",color="red") +
labs(x = "review score") +
ggtitle("Accuracy") +
theme_bw()+
NULL
# plot the distribution of scores on checkin process
p_checkin <- listings%>%
ggplot(aes(x=review_scores_checkin)) +
geom_density(fill="pink",color="red") +
labs(x = "review score") +
ggtitle(" Checkin") +
theme_bw()+
NULL
# plot the distribution of scores on cleanliness
p_cleanliness <- listings%>%
ggplot(aes(x=review_scores_cleanliness)) +
geom_density(fill="pink",color="red") +
labs(x = "review score") +
ggtitle("Cleanliness") +
theme_bw()+
NULL
# plot the distribution of scores on communication
p_communication <- listings%>%
ggplot(aes(x=review_scores_communication)) +
geom_density(fill="pink",color="red") +
labs(x = "review score") +
ggtitle("Communication") +
theme_bw()+
NULL
# plot the distribution of scores on location
p_location <- listings%>%
ggplot(aes(x=review_scores_location)) +
geom_density(fill="pink",color="red") +
labs(x = "review score") +
ggtitle("Location") +
theme_bw()+
NULL
# plot the distribution of scores on value
p_value <- listings%>%
ggplot(aes(x=review_scores_value)) +
geom_density(fill="pink",color="red") +
labs(x = "review score") +
ggtitle("Value") +
theme_bw()+
NULL
#put the plots together
library(patchwork)
combine1<- p_accuracy+p_cleanliness+p_checkin+p_communication+p_location+p_value
combine1Basically, the cancellation policy in Airbnb listings in Brussels is proportioned evenly. About 37% of the listings have a flexible cancellation policy, and around 31% of the listings have a moderate cancellation policy and the rest have a strict policy.
#summary of the cancellation policies
su_cancel<-listings%>%
group_by(cancellation_policy)%>%
summarise(num=n())%>%
mutate(proportion=num/sum(num),
cancellation_policy=factor(cancellation_policy,ordered = TRUE,levels = c("flexible","moderate","strict_14_with_grace_period","super_strict_30")))
su_cancel%>%
kbl() %>%
kable_styling()| cancellation_policy | num | proportion |
|---|---|---|
| flexible | 2136 | 0.377 |
| moderate | 1750 | 0.309 |
| strict_14_with_grace_period | 1781 | 0.314 |
| super_strict_30 | 3 | 0.001 |
#plot the number of listings in different cancellation policies
ggplot(su_cancel,aes(x=cancellation_policy,y=num))+
geom_col(fill="pink")+
labs(title="Airbnb listings with different cancellation policies in Brussels",
x="Cancellation policy",
y="Number of listings")+
theme_bw()+
NULLwe create the price_4_nights variable using the case_when() function. Our conditional statement says that if the number of guests included in the price is greater than or equal to 2, the price is just the cleaning fee plus 4* the nightly rate, since the 2 people will stay for 4 nights. However, if the number of guests included in the price is less than 2, meaning that a party of 2 would have to pay for exactly 1 extra guest, then the price will be 4* the nightly price, which is the base price plus the price for that extra guest, plus the cleaning fee.
The average cost of 2 people staying for 4 nights is 339 dollars and the overall distribution is right-skewed.
#compute the price_4_nights variable
year_host_experience <- year_host_experience %>%
mutate(price_4_nights = case_when(
guests_included >= 2 ~ cleaning_fee + 4*price,
TRUE ~ 4 * (price + extra_people) +cleaning_fee))
#see the summary of price_4_nights
kbl(skim(year_host_experience$price_4_nights))%>%
kable_styling()| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | data | 0 | 1 | 339 | 322 | 36 | 200 | 280 | 380 | 14000 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
ggplot(year_host_experience,aes(x=price_4_nights))+
geom_density(fill="pink",color="red")+
labs(title="Distribution of price for 2 persons staying 4 nights",
x="Price for 4 nights",
y="Density")+
theme_bw()+
scale_x_continuous(breaks = seq(0,1000,500), limits = c(0,1000))+
NULLTo take a look at how the numerical variables are correlated, we use the “ggpair()” function and get the following graphs.
correlation1<- year_host_experience%>%
#reviews
select(review_scores_rating,
review_scores_accuracy,
review_scores_cleanliness,
review_scores_checkin,
review_scores_communication,
review_scores_location,
review_scores_value
) %>%
ggpairs(alpha=0.5) +
theme_bw()
correlation1The review scores are highly correlated to each other with correlation coefficient over 0.4, so we should avoid using them all, and maybe just use one variable as the representative of the reviews.
correlation2<- year_host_experience%>%
#property features and price
select(accommodates,
bathrooms,
bedrooms,
security_deposit,
cleaning_fee,
guests_included,
extra_people,
minimum_nights,
maximum_nights,
price_4_nights) %>%
ggpairs(alpha=0.5) +
theme_bw()
correlation2Based on the above correlation matrix, the number of bedrooms,bathrooms and accommodates are highly correlated with correlation factors over 0.4. Therefore, we would avoid using all three of them in our regression model. Also, we found that the cleaning fee and security deposit are highly correlated to each other, and the correlation is almost linear according to the scatter plot above. Other variables didn’t show obvious linear correlations, therefore we’d like to group some of the variables and see if there’s any correlations between the factors. We started from the potential independent variables.
First, we would like to start from the reviews and explore the relationship between number of reviews and score ratings. We found that the median rating score goes down as the quartile group goes up. However, the average score in bucket 4 is higher than that in bucket 1. After testing the difference of average rating score in bucket 1 and bucket 4, we found that the difference in mean is statistically significant at 95% level. Besides, because the number is highly right-skewed, the sample size is much bigger in quartile group 4 and the result has smaller standard deviation as well. But there’s no linear correlation between these 2 variables.
#Relationship between number of reviews and the score ratings
##select the data
reviewData <- year_host_experience %>%
select(review_scores_rating,number_of_reviews)
glimpse(reviewData)## Rows: 5,670
## Columns: 2
## $ review_scores_rating <dbl> 87, 80, 100, 98, 90, 93, 97, 87, 91, 92, 93, 1...
## $ number_of_reviews <dbl> 15, 2, 3, 12, 4, 59, 6, 48, 25, 33, 3, 1, 45, ...
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | review_scores_rating | 0 | 1 | 92.7 | 8.45 | 20 | 90 | 95 | 98 | 100 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | number_of_reviews | 0 | 1 | 42.0 | 71.31 | 1 | 5 | 14 | 45 | 766 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
##We group the number of reviews by its quartile
reviewData <- reviewData %>%
mutate(buckets = ntile(number_of_reviews,4))
#check the quartile of number of reviews
reviewData2 <- reviewData %>%
group_by(buckets)%>%
summarize(minimum=min(number_of_reviews),maximum=max(number_of_reviews))
reviewData2%>%
kbl()%>%
kable_styling()| buckets | minimum | maximum |
|---|---|---|
| 1 | 1 | 5 |
| 2 | 5 | 14 |
| 3 | 14 | 45 |
| 4 | 45 | 766 |
#change buckets into factors
reviewData$buckets<-as.factor(reviewData$buckets)
#plot rating scores in different groups
ggplot(reviewData,aes(x=review_scores_rating,y=buckets)) +
geom_boxplot(fill="pink",color="red")+
labs(title="Rating scores are more stable in those with higher number of reviews",
x="Rating scores",
y="Quartiles group of review numbers")+
theme_bw()+
NULL# test the difference in bucket 1 and bucket 4
redata<-reviewData%>%
filter(buckets!="2",
buckets!="3")
t.test(review_scores_rating ~ buckets, data=redata)##
## Welch Two Sample t-test
##
## data: review_scores_rating by buckets
## t = -4, df = 1743, p-value = 3e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.298 -0.831
## sample estimates:
## mean in group 1 mean in group 4
## 91.7 93.2
Then we would like to explore the relationship between hosts’ experience and the review scores. Also we tried to form 4 quartile groups based on the host’s experience. The average score of bucket 4 is slightly higher than that of bucket 1. However, there isn’t any linear correlation between these factors as well.
#Relationship between score ratings and year of hosts' experience
##select the data
rating_host_y <- year_host_experience %>%
select(review_scores_rating,year_host)
glimpse(rating_host_y)## Rows: 5,670
## Columns: 2
## $ review_scores_rating <dbl> 87, 80, 100, 98, 90, 93, 97, 87, 91, 92, 93, 1...
## $ year_host <dbl> 11.82, 11.82, 9.86, 9.39, 9.28, 9.28, 9.28, 9....
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | review_scores_rating | 0 | 1 | 92.72 | 8.45 | 20.000 | 90.00 | 95.00 | 98.00 | 100.0 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | year_host | 0 | 1 | 4.59 | 2.19 | 0.068 | 3.04 | 4.81 | 6.06 | 11.8 | <U+2585><U+2586><U+2587><U+2582><U+2581> |
##We group the year of hosts by its quartile
rating_host_y <- rating_host_y %>%
mutate(buckets = ntile(year_host,4))
#check the quartile of years of hosts
rating_host_y2 <- rating_host_y %>%
group_by(buckets)%>%
summarize(minimum=min(year_host),maximum=max(year_host))
rating_host_y2%>%
kbl()%>%
kable_styling()| buckets | minimum | maximum |
|---|---|---|
| 1 | 0.068 | 3.04 |
| 2 | 3.038 | 4.81 |
| 3 | 4.811 | 6.06 |
| 4 | 6.060 | 11.82 |
#change buckets into factors
rating_host_y$buckets<-as.factor(rating_host_y$buckets)
#plot rating scores in different groups
ggplot(rating_host_y,aes(x=review_scores_rating,y=buckets)) +
geom_boxplot(fill="pink",color="red")+
labs(title="Rating scores didn't vary a lot between quartile group of host years",
x="Rating scores",
y="Quartiles group of host years")+
theme_bw()+
NULL# test the difference in bucket 1 and bucket 4
rating_host<-rating_host_y%>%
filter(buckets!="2",
buckets!="3")
t.test(review_scores_rating ~ buckets, data=rating_host)##
## Welch Two Sample t-test
##
## data: review_scores_rating by buckets
## t = -6, df = 2314, p-value = 4e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.60 -1.36
## sample estimates:
## mean in group 1 mean in group 4
## 91.7 93.7
Then we check the relationship between neighborhood and the reviews. No obvious linear correlations are found, yet the standard deviation of the number of reviews and the rating score is slightly different between different neighborhood.
#neighborhood vs. number of reviews
ggplot(listings,aes(x=number_of_reviews,y=reorder(neighbourhood_cleansed,number_of_reviews,median)))+
geom_boxplot(fill="pink",color="red")+
labs(title="Number of reviews on listings in different neighbourhoods in Brussels",
x="Number of reviews",
y=NULL)+
theme_bw()+
NULL#neighborhood vs. rating scores
ggplot(listings,aes(x=review_scores_rating, y=reorder(neighbourhood_cleansed,review_scores_rating),median))+
geom_boxplot(fill="pink",color="red")+
labs(title="Rating scores on listings in different neighbourhoods in Brussels",
x="Rating scores",
y=NULL)+
theme_bw()+
NULLThen we check the relationship between simplified property types and the reviews.
#simplified property type vs. number of reviews
ggplot(listings,aes(x=number_of_reviews, y=reorder(prop_type_simplified,number_of_reviews)))+
geom_boxplot(fill="pink",color="red")+
labs(title="Number of reviews didn't vary much across different property types",
x="Number of reviews",
y=NULL)+
theme_bw()+
NULL#simplified property type vs. rating scores
ggplot(listings,aes(x=review_scores_rating, y=reorder(prop_type_simplified,review_scores_rating)))+
geom_boxplot(fill="pink",color="red")+
labs(title="Rating scores didn't vary much across different property types",
x="Rating scores",
y=NULL)+
theme_bw()+
NULLWhat’s more, we examine the relationship between simplified property types and the neighborhood. Apartments are the most common property type in almost every neighborhoods in Brussels.
prop_type1<-listings%>%
group_by(neighbourhood_cleansed,prop_type_simplified)%>%
summarise(num=n())%>%
mutate(proportion=num/sum(num))%>%
arrange(desc(num))
prop_type1%>%
kbl() %>%
kable_styling()| neighbourhood_cleansed | prop_type_simplified | num | proportion |
|---|---|---|---|
| Bruxelles | Apartment | 1375 | 0.727 |
| Ixelles | Apartment | 758 | 0.812 |
| Saint-Gilles | Apartment | 483 | 0.777 |
| Bruxelles | Other | 283 | 0.150 |
| Schaerbeek | Apartment | 254 | 0.683 |
| Forest | Apartment | 210 | 0.784 |
| Etterbeek | Apartment | 209 | 0.746 |
| Anderlecht | Apartment | 165 | 0.637 |
| Uccle | Apartment | 140 | 0.565 |
| Bruxelles | Condominium | 98 | 0.052 |
| Saint-Josse-ten-Noode | Apartment | 98 | 0.810 |
| Molenbeek-Saint-Jean | Apartment | 97 | 0.651 |
| Woluwe-Saint-Lambert | Apartment | 92 | 0.767 |
| Bruxelles | House | 77 | 0.041 |
| Uccle | House | 62 | 0.250 |
| Ixelles | House | 61 | 0.065 |
| Anderlecht | Other | 60 | 0.232 |
| Bruxelles | Townhouse | 58 | 0.031 |
| Ixelles | Other | 54 | 0.058 |
| Jette | Apartment | 50 | 0.685 |
| Auderghem | Apartment | 45 | 0.523 |
| Saint-Gilles | House | 45 | 0.072 |
| Saint-Gilles | Other | 45 | 0.072 |
| Woluwe-Saint-Pierre | Apartment | 43 | 0.573 |
| Schaerbeek | Other | 41 | 0.110 |
| Schaerbeek | House | 37 | 0.099 |
| Saint-Gilles | Townhouse | 34 | 0.055 |
| Ixelles | Townhouse | 32 | 0.034 |
| Molenbeek-Saint-Jean | Other | 29 | 0.195 |
| Ixelles | Condominium | 28 | 0.030 |
| Etterbeek | House | 27 | 0.096 |
| Evere | Apartment | 25 | 0.641 |
| Koekelberg | Apartment | 25 | 0.781 |
| Forest | House | 23 | 0.086 |
| Auderghem | House | 22 | 0.256 |
| Schaerbeek | Townhouse | 22 | 0.059 |
| Etterbeek | Other | 21 | 0.075 |
| Watermael-Boitsfort | Apartment | 21 | 0.375 |
| Forest | Other | 20 | 0.075 |
| Schaerbeek | Condominium | 18 | 0.048 |
| Watermael-Boitsfort | House | 18 | 0.321 |
| Anderlecht | Condominium | 17 | 0.066 |
| Anderlecht | House | 16 | 0.062 |
| Uccle | Condominium | 16 | 0.065 |
| Uccle | Townhouse | 16 | 0.065 |
| Woluwe-Saint-Pierre | House | 16 | 0.213 |
| Saint-Gilles | Condominium | 15 | 0.024 |
| Ganshoren | Apartment | 14 | 0.667 |
| Uccle | Other | 14 | 0.056 |
| Etterbeek | Townhouse | 13 | 0.046 |
| Woluwe-Saint-Lambert | House | 13 | 0.108 |
| Berchem-Sainte-Agathe | Apartment | 12 | 0.480 |
| Jette | House | 12 | 0.164 |
| Watermael-Boitsfort | Other | 12 | 0.214 |
| Molenbeek-Saint-Jean | Condominium | 11 | 0.074 |
| Etterbeek | Condominium | 10 | 0.036 |
| Forest | Townhouse | 9 | 0.034 |
| Saint-Josse-ten-Noode | House | 9 | 0.074 |
| Auderghem | Other | 8 | 0.093 |
| Auderghem | Townhouse | 8 | 0.093 |
| Saint-Josse-ten-Noode | Other | 8 | 0.066 |
| Molenbeek-Saint-Jean | House | 7 | 0.047 |
| Woluwe-Saint-Pierre | Condominium | 7 | 0.093 |
| Evere | Condominium | 6 | 0.154 |
| Forest | Condominium | 6 | 0.022 |
| Woluwe-Saint-Lambert | Townhouse | 6 | 0.050 |
| Woluwe-Saint-Pierre | Other | 6 | 0.080 |
| Berchem-Sainte-Agathe | House | 5 | 0.200 |
| Jette | Condominium | 5 | 0.068 |
| Molenbeek-Saint-Jean | Townhouse | 5 | 0.034 |
| Woluwe-Saint-Lambert | Other | 5 | 0.042 |
| Berchem-Sainte-Agathe | Other | 4 | 0.160 |
| Evere | House | 4 | 0.103 |
| Evere | Other | 4 | 0.103 |
| Saint-Josse-ten-Noode | Condominium | 4 | 0.033 |
| Watermael-Boitsfort | Townhouse | 4 | 0.071 |
| Woluwe-Saint-Lambert | Condominium | 4 | 0.033 |
| Auderghem | Condominium | 3 | 0.035 |
| Berchem-Sainte-Agathe | Townhouse | 3 | 0.120 |
| Ganshoren | House | 3 | 0.143 |
| Jette | Other | 3 | 0.041 |
| Jette | Townhouse | 3 | 0.041 |
| Koekelberg | Townhouse | 3 | 0.094 |
| Woluwe-Saint-Pierre | Townhouse | 3 | 0.040 |
| Ganshoren | Condominium | 2 | 0.095 |
| Koekelberg | Other | 2 | 0.062 |
| Saint-Josse-ten-Noode | Townhouse | 2 | 0.017 |
| Anderlecht | Townhouse | 1 | 0.004 |
| Berchem-Sainte-Agathe | Condominium | 1 | 0.040 |
| Ganshoren | Other | 1 | 0.048 |
| Ganshoren | Townhouse | 1 | 0.048 |
| Koekelberg | Condominium | 1 | 0.031 |
| Koekelberg | House | 1 | 0.031 |
| Watermael-Boitsfort | Condominium | 1 | 0.018 |
#plot the distribution of rooms types
ggplot(prop_type1,aes(x=num,y=reorder(neighbourhood_cleansed,num),fill=prop_type_simplified)) +
geom_bar(position="dodge",stat="identity") +
labs(title="Distribution of different property types",
subtitle = "across different neighbourhoods in Brussels",
x="Number of listings",
y=NULL)+
theme_bw()+
theme(legend.title = element_blank())+
NULLSince we have to study how the other variable could impact the total cost of staying, we decided to explore the correlation between price_4_nights and the other potential variables.
First we start with the cost and the property types. Apart from the “Other” category, the average costs to stay in a house is 365 dollars, while that for an apartment is 322 dollars. However, we found that the average price of apartment is not statistically significant than that of houses after testing the difference.
#compute summary of price of different property types
sum_price_property<-year_host_experience%>%
group_by(prop_type_simplified)%>%
summarise(avg_price4=mean(price_4_nights),
median_price4=median(price_4_nights),
sd_p4=sd(price_4_nights),
max_p4=max(price_4_nights),
min_p4=max(price_4_nights))
#show the summary
sum_price_property%>%
kbl()%>%
kable_styling()| prop_type_simplified | avg_price4 | median_price4 | sd_p4 | max_p4 | min_p4 |
|---|---|---|---|---|---|
| Apartment | 322 | 274 | 239 | 4120 | 4120 |
| Condominium | 339 | 303 | 174 | 1200 | 1200 |
| House | 365 | 235 | 703 | 14000 | 14000 |
| Other | 443 | 360 | 394 | 4910 | 4910 |
| Townhouse | 306 | 220 | 255 | 1930 | 1930 |
#plot the box plot based on simplified property type
ggplot(year_host_experience,aes(x=prop_type_simplified,y=price_4_nights))+
geom_boxplot(fill="pink",color="red")+
labs(title="Price for 4 nights in different property types",
x=NULL,
y="Price for 4 nights")+
scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
theme_bw()+
NULL# test the difference in apartment and house
test_prop_price<-year_host_experience%>%
filter(prop_type_simplified %in% c("Apartment","House"))
t.test(price_4_nights ~ prop_type_simplified, data=test_prop_price)##
## Welch Two Sample t-test
##
## data: price_4_nights by prop_type_simplified
## t = -1, df = 469, p-value = 0.2
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -107.5 22.5
## sample estimates:
## mean in group Apartment mean in group House
## 322 365
We also examined the relationship between price and room types. After testing the difference between average price in each group, we found that hotel rooms are more expensive than the entire home or apartment, and the difference is statistically significant at 95% level. Meanwhile, the average price of private room is higher than that of shared room, and the difference is statistically significant at 90% level.
#compute summary of price of different room types
sum_price_room<-year_host_experience%>%
group_by(room_type)%>%
summarise(avg_price4=mean(price_4_nights),
median_price4=median(price_4_nights),
sd_p4=sd(price_4_nights),
max_p4=max(price_4_nights),
min_p4=max(price_4_nights))
#show the summary
sum_price_room%>%
kbl()%>%
kable_styling()| room_type | avg_price4 | median_price4 | sd_p4 | max_p4 | min_p4 |
|---|---|---|---|---|---|
| Entire home/apt | 380 | 320 | 246 | 4120 | 4120 |
| Hotel room | 662 | 448 | 686 | 4910 | 4910 |
| Private room | 235 | 189 | 419 | 14000 | 14000 |
| Shared room | 197 | 172 | 116 | 667 | 667 |
#plot the box plot based on simplified property type
ggplot(year_host_experience,aes(x=room_type,y=price_4_nights))+
geom_boxplot(fill="pink",color="red")+
labs(title="Price for 4 nights varies in different room types",
x=NULL,
y="Price for 4 nights")+
scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
theme_bw()+
NULL# test the difference in hotel room and entire home/apt
test_room_price1<-year_host_experience%>%
filter(room_type %in% c("Entire home/apt","Hotel room"))
t.test(price_4_nights ~ room_type, data=test_room_price1)##
## Welch Two Sample t-test
##
## data: price_4_nights by room_type
## t = -3, df = 58, p-value = 0.003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -461 -103
## sample estimates:
## mean in group Entire home/apt mean in group Hotel room
## 380 662
# test the difference in private room and shared home
test_room_price1<-year_host_experience%>%
filter(room_type %in% c("Private room","Shared room"))
t.test(price_4_nights ~ room_type, data=test_room_price1)##
## Welch Two Sample t-test
##
## data: price_4_nights by room_type
## t = 2, df = 44, p-value = 0.1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.95 86.08
## sample estimates:
## mean in group Private room mean in group Shared room
## 235 197
Then we started with the cost and the reviews by creating scatter plots. We found that the price didn’t goes up as the number of reviews goes up. Maybe people reviews more on the property with not that high price on Airbnb.
On the other hand, we noticed that the maximum price goes up when the ratings goes up, which somehow suggests that people might tend to pay higher price for those with higher rating.
#examine the relationship between price and number of reviews
ggplot(year_host_experience,aes(x=number_of_reviews,y=price_4_nights,color="red"))+
geom_point()+
labs(title="Price for 4 nights change as reviews numbers goes up",
x=NULL,
y="Price for 4 nights")+
scale_y_continuous(breaks = c(0,5000,500),limits = c(0,5000))+
theme_bw()+
NULL#examine the relationship between price and rating scores
ggplot(year_host_experience,aes(x=review_scores_rating,y=price_4_nights,color="red"))+
geom_point(show.legend = FALSE)+
labs(title="Maximum price for 4 nights goes up as rating scores goes up",
x=NULL,
y="Price for 4 nights")+
scale_y_continuous(breaks = c(0,5000,500),limits = c(0,5000))+
theme_bw()+
NULLHowever, the average price didn’t change as the rating scores goes up. We also formed 4 groups based on the quartile of ratings scores and and found the difference between the average costs is not statistically significant.
##We group the rating scores by its quartile
rating_price <- year_host_experience %>%
mutate(buckets = ntile(review_scores_rating,4))
#check the quartile of rating scores
rating_price2 <- rating_price %>%
group_by(buckets)%>%
summarize(minimum=min(review_scores_rating),maximum=max(review_scores_rating))
rating_price2%>%
kbl()%>%
kable_styling()| buckets | minimum | maximum |
|---|---|---|
| 1 | 20 | 90 |
| 2 | 90 | 95 |
| 3 | 95 | 98 |
| 4 | 98 | 100 |
#change buckets into factors
rating_price$buckets<-as.factor(rating_price$buckets)
#plot price for 4 nights in different groups
ggplot(rating_price,aes(x=price_4_nights,y=buckets)) +
geom_boxplot(fill="pink",color="red")+
labs(title="Average price didn't vary a lot between different groups of rating scores",
x="Price for 4 nights",
y="Quartiles group of rating scores")+
scale_x_continuous(breaks = c(0,1000,500),limits=c(0,1000))+
theme_bw()+
NULL# test the difference in bucket 1 and bucket 4
rating_price<-rating_price%>%
filter(buckets!="2",
buckets!="3")
t.test(price_4_nights ~ buckets, data=rating_price)##
## Welch Two Sample t-test
##
## data: price_4_nights by buckets
## t = -1, df = 2821, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -30.75 9.25
## sample estimates:
## mean in group 1 mean in group 4
## 336 347
Next, we would like to see whether the hosting experience is correlated to the total price. According to the following graphs, we found that the price didn’t vary a lot between experienced hosts and new hosts. Besides, listings with a super host may have a higher average price than others, and the difference is statistically significant at 90% level.
#examine the relationship between price and years of host
ggplot(year_host_experience,aes(x=year_host,y=price_4_nights,color="red"))+
geom_point()+
labs(title="Price for 4 nights didn't vary a lot between experienced hosts and new hosts",
x="years of hosting experience",
y="Price for 4 nights")+
scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
theme_bw()+
NULL#examine the relationship between price and super hosts
ggplot(year_host_experience,aes(x=host_is_superhost,y=price_4_nights))+
geom_boxplot(fill="pink",color="red")+
labs(title="Price for 4 nights goes up as rating scores goes up",
x=NULL,
y="Price for 4 nights")+
scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
theme_bw()+
NULL#test the significance of the difference
t.test(price_4_nights ~ host_is_superhost, data=year_host_experience)##
## Welch Two Sample t-test
##
## data: price_4_nights by host_is_superhost
## t = -2, df = 2689, p-value = 0.1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -29.33 3.81
## sample estimates:
## mean in group FALSE mean in group TRUE
## 337 349
After that, we examine the relationship between price and cancellation policy. From the graphs below, we could see that listings with stricter cancellation policies cost slightly more in average than those with more flexible policies. This could be related to how popular the listings are. Strict cancellation policy might suggest that the listing is more popular in the market, thus the price would be a little bit higher.
#examine the relationship between price and cancellation policies
ggplot(year_host_experience,aes(x=cancellation_policy,y=price_4_nights))+
geom_boxplot(fill="pink",color="red")+
labs(title="Price for 4 nights goes up as policy becomes stricter",
x="Cancellation policy",
y="Price for 4 nights")+
scale_y_continuous(breaks = c(0,1000,500),limits = c(0,1000))+
theme_bw()+
NULLFinally, we examine the correlation between price and location. The median of total costs slightly varied among different neiborhood. Bruxelles has the highest median costs and the difference between its median cost and that of Evere is statistically significant at 95% level.
#price_4_nights in different neighborhood
ggplot(year_host_experience,aes(x=price_4_nights,y=reorder(neighbourhood_cleansed,price_4_nights,median)))+
geom_boxplot(fill="pink",color="red")+
labs(title="Distribution of price for 4 nights in different neighbourhoods in Brussels",
x="Price for 4 nights",
y=NULL)+
scale_x_continuous(breaks = c(0,2000,500),limits = c(0,2000))+
theme_bw()+
NULL# test the difference of average price in Bruxelles and Evere
test_neighbor_price<-year_host_experience%>%
filter(neighbourhood_cleansed %in% c("Bruxelles","Evere"))
t.test(price_4_nights ~ neighbourhood_cleansed, data=test_neighbor_price)##
## Welch Two Sample t-test
##
## data: price_4_nights by neighbourhood_cleansed
## t = 7, df = 49, p-value = 2e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 97.5 181.9
## sample estimates:
## mean in group Bruxelles mean in group Evere
## 395 255
#Data used for regression
listings2<- year_host_experience %>%
select(price,
cleaning_fee,
extra_people,
property_type,
room_type,
number_of_reviews,
review_scores_rating,
longitude,
latitude,
neighbourhood,
guests_included,
host_is_superhost,
beds,
cancellation_policy,
is_location_exact,
bedrooms,
bathrooms,
accommodates,
neighbourhood_cleansed,
prop_type_simplified,
price_4_nights)
listings3 <- listings2
#Grouping neighbourhoods
listings3 <- listings3 %>%
mutate(neighbourhood_simplified=case_when(neighbourhood_cleansed=="Bruxelles"|neighbourhood_cleansed=="Etterbeek"|neighbourhood_cleansed=="Ixelles"|neighbourhood_cleansed=="Saint-Gilles"|neighbourhood_cleansed=="Saint-Josse-ten-Noode"|neighbourhood_cleansed=="Schaerbeek"~"Central",neighbourhood_cleansed=="Berchem-Sainte-Agathe"|neighbourhood_cleansed=="Evere"|neighbourhood_cleansed=="Ganshoren"|neighbourhood_cleansed=="Koekelberg"|neighbourhood_cleansed=="Jette"~"North",neighbourhood_cleansed=="Uccle"|neighbourhood_cleansed=="Watermael-Boitsfort"~"South",neighbourhood_cleansed=="Anderlecht"|neighbourhood_cleansed=="Forest"|neighbourhood_cleansed=="Molenbeek-Saint-Jean"~"West",neighbourhood_cleansed=="Auderghem"|neighbourhood_cleansed=="Woluwe-Saint-Lambert"|neighbourhood_cleansed=="Woluwe-Saint-Pierre"~"East", TRUE~neighbourhood_cleansed))
#Adding log variable
listings3 <- listings3 %>%
mutate(log_price_4_nights=log(price_4_nights))plot2 <- listings3 %>%
ggplot(aes(x=price_4_nights))+
geom_histogram(color="blue",fill="tomato2")
plot2plot3 <- listings3 %>%
ggplot(aes(x=log_price_4_nights))+
geom_histogram(color="blue",fill="tomato2")
plot3The first reason why we will use a log dependent variable (log_price_4_nights) is the interpretation. It is more meaningful to see how each explanatory variable affects the dependent variable in terms of percentages than in the absolute terms. Secondly, as can be seen from the plots above, when we transform the price_4_nights in the log form, it more closely follows a normal distribution while the normal form is skewed to the right. So, it might help us reduce heteroscedasticity of residuals and improve the precision of our estimates.
The goal is to determine how a variety of different variables affects price for 4 nights in an Airbnb accommodation in Bruxelles. We start by making a model where the dependent variable is log_price_4_nights, and the explanatory variables are prop_type_simplified, number_of_reviews, and review_scores_rating. When there is a dependent variable in a log form, coefficients are not interpreted in the standard way. Specifically, if we have a coefficient B1, then that means that a one unit increase in the variable X1 increases the dependent variable by (e^B1-1). But for small enough values of B1, e^B1 - 1 is roughly equal to B1. This is a good approximation when -0.1 < B < 0.1. For values outside of this range, we will use e^B1 - 1 to calculate the effect.
In each regression, we will use vif() to check whether there exists multicollinearity among the explanatory variables. Similarly, we will use autoplot() and see how the model satisfies 4 main OLS assumptions.
#Regression model 1
model1 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating, data=listings3)
table1 <- get_regression_table(model1)
table1| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.37 | 0.078 | 69.2 | 0 | 5.21 | 5.52 |
| prop_type_simplifiedCondominium | 0.085 | 0.034 | 2.47 | 0.013 | 0.018 | 0.152 |
| prop_type_simplifiedHouse | -0.075 | 0.026 | -2.88 | 0.004 | -0.126 | -0.024 |
| prop_type_simplifiedOther | 0.285 | 0.023 | 12.5 | 0 | 0.241 | 0.33 |
| prop_type_simplifiedTownhouse | -0.117 | 0.036 | -3.23 | 0.001 | -0.188 | -0.046 |
| number_of_reviews | 0 | 0 | -1.13 | 0.258 | 0 | 0 |
| review_scores_rating | 0.003 | 0.001 | 3.52 | 0 | 0.001 | 0.005 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.01 4 1
## number_of_reviews 1.00 1 1
## review_scores_rating 1.00 1 1
With the prop type simplified, we have 4 dummy variables where the omitted group is “apartment”, so coefficients need to be interpreted in relation with the apartment group. For instance, on the prop_type_simplifiedHouse the coefficient is -0.075. This means that on average holding other things constant, houses are roughly 7.5% cheaper than the apartments. Similarly, townhouses are roughly 11% cheaper than apartments, and the accommodations belonging to the “other group” are around 33% more expensive. Condominiums are correlated with 8.5% higher prices than in the apartments. In those 4 cases, the coefficients are significant as p-values are very low. Furthermore, the coefficient on review_scores_rating is 0.003, meaning that a unit increase in this variable, holding everything else constant, increases prices by roughly 0.3%. This statistic is significant. Number_of_reviews seems to have no effect on prices as the coefficient is ~0.
There seems to be no collinearity among explanatory variables as VIF values are all significantly lower than 5. In this model, R-sqaured is very low and there is a heteroscedasticity of residuals, which is a violation of one of the main assumptions of linear regressions. Although heteroscedasticity does not lead to bias in the estimates, it might make them less precise and lead to incorrect standard errors. Hence, in the subsequent models we will add new variables. In addition, even though the above-mentioned coefficients seem to have effect on the dependent variable, this does not necessarily imply causation. When variables which are correlated with both some of the explanatory variables and with the dependent variable are not included in the model, coefficients on some variables cause ‘omitted variables bias’. For instance, it might not necessarily be the case that the type of the accommodation being a house automatically leads to lower prices, rather it might be that the houses are generally located in cheaper regions than apartments, or that the people renting houses take only one room whereas people renting apartments usually take the whole place. That is also why in the next models, we control for variables which might cause this bias (confounding variables).
#Regression model2, with room_type
model2 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type, data=listings3)
table2 <- get_regression_table(model2)
table2| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.49 | 0.068 | 81 | 0 | 5.36 | 5.62 |
| prop_type_simplifiedCondominium | 0.072 | 0.03 | 2.4 | 0.017 | 0.013 | 0.13 |
| prop_type_simplifiedHouse | 0.183 | 0.024 | 7.78 | 0 | 0.137 | 0.229 |
| prop_type_simplifiedOther | 0.317 | 0.021 | 15.3 | 0 | 0.277 | 0.358 |
| prop_type_simplifiedTownhouse | 0.185 | 0.032 | 5.71 | 0 | 0.122 | 0.249 |
| number_of_reviews | 0 | 0 | -2.23 | 0.026 | 0 | 0 |
| review_scores_rating | 0.003 | 0.001 | 4.32 | 0 | 0.002 | 0.005 |
| room_typeHotel room | 0.163 | 0.063 | 2.58 | 0.01 | 0.039 | 0.287 |
| room_typePrivate room | -0.589 | 0.014 | -41.3 | 0 | -0.617 | -0.561 |
| room_typeShared room | -0.7 | 0.085 | -8.26 | 0 | -0.866 | -0.534 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.24 4 1.03
## number_of_reviews 1.01 1 1.00
## review_scores_rating 1.00 1 1.00
## room_type 1.23 3 1.04
In this regression, we add variable room_type, which can take 4 values: entire house, shared room, private room, and hotel room. WE created 3 dummies for hotel, private and shared rooms. Hence, the omitted group is entire house and the coefficients on these dummies must be interpreted with respect to this omitted group. Firstly, the coefficient on the “hotel room” dummy, is 0.163, meaning that, ceteris paribus, hotel rooms are on average 17.7% more expensive than entire houses for 4 nights. On the private room dummy, we get a coefficient of -0.589, when transformed, tells us that private rooms are roughly 45% cheaper than entire houses, holding everything else constant. The largest gap is with shared rooms, which seem to be around 50% cheaper than the entire houses. All those statistics are significant even at 1% level. It is interesting to note that for instance, the coefficient on prop_type_simplifiedHouse completely changed, and is now positive. It means that it was negatively biased in the first regression, and it was capturing some of the effect effect of the room_type.
#Adding new explanatory variables - bathrooms, bedrooms, and beds
model3 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds, data=listings3)
table3 <- get_regression_table(model3)
table3| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.02 | 0.063 | 80.2 | 0 | 4.9 | 5.14 |
| prop_type_simplifiedCondominium | 0.067 | 0.027 | 2.52 | 0.012 | 0.015 | 0.12 |
| prop_type_simplifiedHouse | -0.003 | 0.022 | -0.158 | 0.874 | -0.046 | 0.039 |
| prop_type_simplifiedOther | 0.236 | 0.019 | 12.6 | 0 | 0.199 | 0.273 |
| prop_type_simplifiedTownhouse | 0.019 | 0.029 | 0.631 | 0.528 | -0.039 | 0.076 |
| number_of_reviews | 0 | 0 | -2.23 | 0.026 | 0 | 0 |
| review_scores_rating | 0.004 | 0.001 | 5.36 | 0 | 0.002 | 0.005 |
| room_typeHotel room | 0.247 | 0.057 | 4.36 | 0 | 0.136 | 0.358 |
| room_typePrivate room | -0.457 | 0.013 | -34.1 | 0 | -0.483 | -0.43 |
| room_typeShared room | -0.623 | 0.08 | -7.8 | 0 | -0.779 | -0.466 |
| bathrooms | 0.102 | 0.014 | 7.34 | 0 | 0.075 | 0.129 |
| bedrooms | 0.144 | 0.01 | 14.8 | 0 | 0.125 | 0.163 |
| beds | 0.07 | 0.006 | 12.2 | 0 | 0.059 | 0.082 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.34 4 1.04
## number_of_reviews 1.01 1 1.01
## review_scores_rating 1.01 1 1.00
## room_type 1.36 3 1.05
## bathrooms 1.37 1 1.17
## bedrooms 2.07 1 1.44
## beds 1.98 1 1.41
In this model, WE added variables ‘bathrooms’, ‘bedrooms’, and ‘beds’. The coefficients on all three new variables are very significant (p-values are very low) and they have a significant effect on prices, as expected. For instance, the coefficient on bedrooms is 0.144, which means that a unit increase in bedroom leads to roughly a 15.5% increase in prices. Similarly, the coefficients for beds and bathrooms are 0.07 and 0.1, respectively. Like in the previous model, the coefficients on the house and townhouse dummies changed, and they are now insignificant.
model4 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds+accommodates, data=listings3)
table4 <- get_regression_table(model4)
table4| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 4.82 | 0.062 | 78.4 | 0 | 4.7 | 4.94 |
| prop_type_simplifiedCondominium | 0.071 | 0.026 | 2.73 | 0.006 | 0.02 | 0.122 |
| prop_type_simplifiedHouse | 0 | 0.021 | -0.017 | 0.986 | -0.042 | 0.041 |
| prop_type_simplifiedOther | 0.217 | 0.018 | 11.9 | 0 | 0.181 | 0.253 |
| prop_type_simplifiedTownhouse | 0.017 | 0.028 | 0.603 | 0.547 | -0.039 | 0.073 |
| number_of_reviews | 0 | 0 | -2.88 | 0.004 | 0 | 0 |
| review_scores_rating | 0.004 | 0.001 | 6.96 | 0 | 0.003 | 0.006 |
| room_typeHotel room | 0.304 | 0.055 | 5.54 | 0 | 0.196 | 0.412 |
| room_typePrivate room | -0.386 | 0.013 | -28.6 | 0 | -0.412 | -0.359 |
| room_typeShared room | -0.498 | 0.078 | -6.42 | 0 | -0.65 | -0.346 |
| bathrooms | 0.065 | 0.014 | 4.76 | 0 | 0.038 | 0.091 |
| bedrooms | 0.082 | 0.01 | 8.23 | 0 | 0.063 | 0.102 |
| beds | -0.012 | 0.007 | -1.74 | 0.082 | -0.026 | 0.002 |
| accommodates | 0.112 | 0.006 | 19.3 | 0 | 0.1 | 0.123 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.34 4 1.04
## number_of_reviews 1.01 1 1.01
## review_scores_rating 1.02 1 1.01
## room_type 1.48 3 1.07
## bathrooms 1.40 1 1.18
## bedrooms 2.31 1 1.52
## beds 3.14 1 1.77
## accommodates 3.69 1 1.92
In the model 4, we add variable ‘accommodates’ which signifies the size of the accommodation. The coefficient on the variable ‘accommodates’ is 0.112, meaning that a unit increase increases price by around 11.85%, holding other variables constant. At this point, there is still no high multicollinearity among the variables, as Variance Inflation Factor (VIf) is not higher than 5 for any variable.
model5 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds+accommodates+host_is_superhost+ is_location_exact, data=listings3)
table5 <- get_regression_table(model5)
table5| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 4.91 | 0.064 | 77.2 | 0 | 4.78 | 5.03 |
| prop_type_simplifiedCondominium | 0.068 | 0.026 | 2.63 | 0.009 | 0.017 | 0.118 |
| prop_type_simplifiedHouse | -0.002 | 0.021 | -0.091 | 0.927 | -0.043 | 0.039 |
| prop_type_simplifiedOther | 0.218 | 0.018 | 12 | 0 | 0.182 | 0.253 |
| prop_type_simplifiedTownhouse | 0.003 | 0.028 | 0.118 | 0.906 | -0.052 | 0.059 |
| number_of_reviews | 0 | 0 | -4.58 | 0 | 0 | 0 |
| review_scores_rating | 0.003 | 0.001 | 4.7 | 0 | 0.002 | 0.004 |
| room_typeHotel room | 0.295 | 0.055 | 5.42 | 0 | 0.188 | 0.402 |
| room_typePrivate room | -0.383 | 0.013 | -28.6 | 0 | -0.41 | -0.357 |
| room_typeShared room | -0.482 | 0.077 | -6.24 | 0 | -0.633 | -0.33 |
| bathrooms | 0.062 | 0.014 | 4.55 | 0 | 0.035 | 0.088 |
| bedrooms | 0.081 | 0.01 | 8.17 | 0 | 0.062 | 0.1 |
| beds | -0.012 | 0.007 | -1.7 | 0.09 | -0.026 | 0.002 |
| accommodates | 0.113 | 0.006 | 19.7 | 0 | 0.102 | 0.125 |
| host_is_superhostTRUE | 0.11 | 0.014 | 7.98 | 0 | 0.083 | 0.137 |
| is_location_exactTRUE | 0.027 | 0.015 | 1.82 | 0.068 | -0.002 | 0.056 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.35 4 1.04
## number_of_reviews 1.06 1 1.03
## review_scores_rating 1.09 1 1.04
## room_type 1.48 3 1.07
## bathrooms 1.40 1 1.18
## bedrooms 2.31 1 1.52
## beds 3.14 1 1.77
## accommodates 3.70 1 1.92
## host_is_superhost 1.13 1 1.06
## is_location_exact 1.01 1 1.00
We can also check whether the facts that a host is a superhost and whether the location is exact lead to higher prices when controlling for other variables. The coefficient on the superhost dummy is 0.11 meaning that being a superhost is correlated with roughly 11% higher prices, ceteris paribus. This coefficient is significant at 1 % level. On the other hand, the coefficient on the exact location dummy is positive and equals 0.027, however it is not significant at the 5% level. Furthermore, the VIF remains the highest on ‘accommodates’ but it is still lower than 5.
model6 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds+accommodates+host_is_superhost+ is_location_exact+cancellation_policy, data=listings3)
table6 <- get_regression_table(model6)
table6| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 4.88 | 0.063 | 77 | 0 | 4.75 | 5 |
| prop_type_simplifiedCondominium | 0.064 | 0.026 | 2.48 | 0.013 | 0.013 | 0.114 |
| prop_type_simplifiedHouse | -0.005 | 0.021 | -0.219 | 0.826 | -0.045 | 0.036 |
| prop_type_simplifiedOther | 0.217 | 0.018 | 12 | 0 | 0.181 | 0.252 |
| prop_type_simplifiedTownhouse | 0.005 | 0.028 | 0.179 | 0.858 | -0.05 | 0.06 |
| number_of_reviews | 0 | 0 | -5.57 | 0 | -0.001 | 0 |
| review_scores_rating | 0.003 | 0.001 | 4.73 | 0 | 0.002 | 0.004 |
| room_typeHotel room | 0.297 | 0.056 | 5.35 | 0 | 0.188 | 0.406 |
| room_typePrivate room | -0.375 | 0.013 | -28.1 | 0 | -0.402 | -0.349 |
| room_typeShared room | -0.492 | 0.077 | -6.41 | 0 | -0.643 | -0.342 |
| bathrooms | 0.06 | 0.013 | 4.48 | 0 | 0.034 | 0.087 |
| bedrooms | 0.083 | 0.01 | 8.39 | 0 | 0.063 | 0.102 |
| beds | -0.011 | 0.007 | -1.59 | 0.111 | -0.025 | 0.003 |
| accommodates | 0.11 | 0.006 | 19.2 | 0 | 0.099 | 0.122 |
| host_is_superhostTRUE | 0.105 | 0.014 | 7.61 | 0 | 0.078 | 0.132 |
| is_location_exactTRUE | 0.029 | 0.015 | 1.95 | 0.051 | 0 | 0.057 |
| cancellation_policymoderate | 0.029 | 0.013 | 2.25 | 0.024 | 0.004 | 0.054 |
| cancellation_policystrict_14_with_grace_period | 0.105 | 0.013 | 8.1 | 0 | 0.08 | 0.131 |
| cancellation_policysuper_strict_30 | 0.025 | 0.234 | 0.107 | 0.915 | -0.433 | 0.483 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.36 4 1.04
## number_of_reviews 1.08 1 1.04
## review_scores_rating 1.09 1 1.04
## room_type 1.57 3 1.08
## bathrooms 1.40 1 1.18
## bedrooms 2.31 1 1.52
## beds 3.14 1 1.77
## accommodates 3.71 1 1.93
## host_is_superhost 1.14 1 1.07
## is_location_exact 1.01 1 1.00
## cancellation_policy 1.12 3 1.02
In this model we add cancellation policy in the regression and see whether it correlates with the higher prices. Three dummies are created corresponding to strict, super strict and moderate cancellation policy, and the omitted group is ‘flexible’. Compared to flexible, other types of policies seem to correlate with higher prices, however the estimate on the super strict dummy is not significant due to a very low sample size. Of course, this cannot be interpreted as a causal impact, and actually there might be a reverse relationship, since higher quality and hence more expensive apartments might have stricter policies for instance.
model7 <- lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type+bathrooms+bedrooms+beds+accommodates+host_is_superhost+ is_location_exact+cancellation_policy+neighbourhood_simplified, data=listings3)
table7 <- get_regression_table(model7)
table7| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 4.91 | 0.063 | 78 | 0 | 4.79 | 5.03 |
| prop_type_simplifiedCondominium | 0.075 | 0.025 | 2.93 | 0.003 | 0.025 | 0.124 |
| prop_type_simplifiedHouse | 0.008 | 0.021 | 0.405 | 0.685 | -0.033 | 0.049 |
| prop_type_simplifiedOther | 0.223 | 0.018 | 12.4 | 0 | 0.188 | 0.258 |
| prop_type_simplifiedTownhouse | 0.003 | 0.028 | 0.111 | 0.912 | -0.052 | 0.058 |
| number_of_reviews | 0 | 0 | -6.18 | 0 | -0.001 | 0 |
| review_scores_rating | 0.003 | 0.001 | 4.78 | 0 | 0.002 | 0.004 |
| room_typeHotel room | 0.279 | 0.055 | 5.06 | 0 | 0.171 | 0.387 |
| room_typePrivate room | -0.368 | 0.013 | -27.7 | 0 | -0.394 | -0.342 |
| room_typeShared room | -0.494 | 0.076 | -6.49 | 0 | -0.643 | -0.345 |
| bathrooms | 0.055 | 0.013 | 4.15 | 0 | 0.029 | 0.082 |
| bedrooms | 0.087 | 0.01 | 8.84 | 0 | 0.067 | 0.106 |
| beds | -0.007 | 0.007 | -1.05 | 0.296 | -0.021 | 0.006 |
| accommodates | 0.108 | 0.006 | 18.9 | 0 | 0.097 | 0.119 |
| host_is_superhostTRUE | 0.106 | 0.014 | 7.81 | 0 | 0.08 | 0.133 |
| is_location_exactTRUE | 0.02 | 0.015 | 1.38 | 0.169 | -0.009 | 0.049 |
| cancellation_policymoderate | 0.027 | 0.013 | 2.11 | 0.035 | 0.002 | 0.052 |
| cancellation_policystrict_14_with_grace_period | 0.105 | 0.013 | 8.19 | 0 | 0.08 | 0.131 |
| cancellation_policysuper_strict_30 | 0.012 | 0.232 | 0.053 | 0.958 | -0.442 | 0.466 |
| neighbourhood_simplifiedEast | -0.07 | 0.024 | -2.88 | 0.004 | -0.118 | -0.022 |
| neighbourhood_simplifiedNorth | -0.198 | 0.029 | -6.82 | 0 | -0.256 | -0.141 |
| neighbourhood_simplifiedSouth | -0.073 | 0.024 | -3.08 | 0.002 | -0.119 | -0.026 |
| neighbourhood_simplifiedWest | -0.122 | 0.016 | -7.45 | 0 | -0.154 | -0.09 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.41 4 1.04
## number_of_reviews 1.09 1 1.04
## review_scores_rating 1.09 1 1.04
## room_type 1.58 3 1.08
## bathrooms 1.41 1 1.19
## bedrooms 2.32 1 1.52
## beds 3.15 1 1.78
## accommodates 3.73 1 1.93
## host_is_superhost 1.14 1 1.07
## is_location_exact 1.01 1 1.01
## cancellation_policy 1.12 3 1.02
## neighbourhood_simplified 1.09 4 1.01
Finally, we grouped 19 different neighbourhoods of Bruxelles according to their location in 5 groups: Central, West, North, East, and South. In this model, we will also control for these variables. The omitted group is ‘Central’ and the corresponding coefficients on dummies tell us about the gap between the particular region and the central part. Compared to the central part, other parts of the city correlate with lower prices holding other variables constant and estimates are statistically significant. Omitted variables bias in the previous regressions might be caused by location, as location is correlated with both price and the variables such as number_of_reviews, exact location, rating etc. In addition, we check whether this model satisfies the main LINE assumptions of linear regressions by using autoplot. We can see that there exists a clear linear relationship between price_4_nights and explanatory variables, and that residuals are roughly normally distributed except at the borders (Normal Q-Q plot). In addition, it seems that there is no heteroscedasticity and that variance of the residuals is constant. In addition, there is no multicollinearity. Hence, we will use this model to make a prediction.
#Overall table
huxreg(model1,model2,model3,model4,model5,model6,model7,statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'))| (1) | (2) | (3) | (4) | (5) | (6) | (7) | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 5.366 *** | 5.489 *** | 5.022 *** | 4.824 *** | 4.908 *** | 4.875 *** | 4.910 *** |
| (0.078) | (0.068) | (0.063) | (0.062) | (0.064) | (0.063) | (0.063) | |
| prop_type_simplifiedCondominium | 0.085 * | 0.072 * | 0.067 * | 0.071 ** | 0.068 ** | 0.064 * | 0.075 ** |
| (0.034) | (0.030) | (0.027) | (0.026) | (0.026) | (0.026) | (0.025) | |
| prop_type_simplifiedHouse | -0.075 ** | 0.183 *** | -0.003 | -0.000 | -0.002 | -0.005 | 0.008 |
| (0.026) | (0.024) | (0.022) | (0.021) | (0.021) | (0.021) | (0.021) | |
| prop_type_simplifiedOther | 0.285 *** | 0.317 *** | 0.236 *** | 0.217 *** | 0.218 *** | 0.217 *** | 0.223 *** |
| (0.023) | (0.021) | (0.019) | (0.018) | (0.018) | (0.018) | (0.018) | |
| prop_type_simplifiedTownhouse | -0.117 ** | 0.185 *** | 0.019 | 0.017 | 0.003 | 0.005 | 0.003 |
| (0.036) | (0.032) | (0.029) | (0.028) | (0.028) | (0.028) | (0.028) | |
| number_of_reviews | -0.000 | -0.000 * | -0.000 * | -0.000 ** | -0.000 *** | -0.000 *** | -0.000 *** |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | 0.003 *** | 0.003 *** | 0.004 *** | 0.004 *** | 0.003 *** | 0.003 *** | 0.003 *** |
| (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
| room_typeHotel room | 0.163 ** | 0.247 *** | 0.304 *** | 0.295 *** | 0.297 *** | 0.279 *** | |
| (0.063) | (0.057) | (0.055) | (0.055) | (0.056) | (0.055) | ||
| room_typePrivate room | -0.589 *** | -0.457 *** | -0.386 *** | -0.383 *** | -0.375 *** | -0.368 *** | |
| (0.014) | (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | ||
| room_typeShared room | -0.700 *** | -0.623 *** | -0.498 *** | -0.482 *** | -0.492 *** | -0.494 *** | |
| (0.085) | (0.080) | (0.078) | (0.077) | (0.077) | (0.076) | ||
| bathrooms | 0.102 *** | 0.065 *** | 0.062 *** | 0.060 *** | 0.055 *** | ||
| (0.014) | (0.014) | (0.014) | (0.013) | (0.013) | |||
| bedrooms | 0.144 *** | 0.082 *** | 0.081 *** | 0.083 *** | 0.087 *** | ||
| (0.010) | (0.010) | (0.010) | (0.010) | (0.010) | |||
| beds | 0.070 *** | -0.012 | -0.012 | -0.011 | -0.007 | ||
| (0.006) | (0.007) | (0.007) | (0.007) | (0.007) | |||
| accommodates | 0.112 *** | 0.113 *** | 0.110 *** | 0.108 *** | |||
| (0.006) | (0.006) | (0.006) | (0.006) | ||||
| host_is_superhostTRUE | 0.110 *** | 0.105 *** | 0.106 *** | ||||
| (0.014) | (0.014) | (0.014) | |||||
| is_location_exactTRUE | 0.027 | 0.029 | 0.020 | ||||
| (0.015) | (0.015) | (0.015) | |||||
| cancellation_policymoderate | 0.029 * | 0.027 * | |||||
| (0.013) | (0.013) | ||||||
| cancellation_policystrict_14_with_grace_period | 0.105 *** | 0.105 *** | |||||
| (0.013) | (0.013) | ||||||
| cancellation_policysuper_strict_30 | 0.025 | 0.012 | |||||
| (0.234) | (0.232) | ||||||
| neighbourhood_simplifiedEast | -0.070 ** | ||||||
| (0.024) | |||||||
| neighbourhood_simplifiedNorth | -0.198 *** | ||||||
| (0.029) | |||||||
| neighbourhood_simplifiedSouth | -0.073 ** | ||||||
| (0.024) | |||||||
| neighbourhood_simplifiedWest | -0.122 *** | ||||||
| (0.016) | |||||||
| #observations | 5670 | 5670 | 5645 | 5645 | 5645 | 5645 | 5645 |
| R squared | 0.035 | 0.265 | 0.415 | 0.451 | 0.458 | 0.464 | 0.474 |
| Adj. R Squared | 0.034 | 0.264 | 0.413 | 0.450 | 0.456 | 0.462 | 0.472 |
| Residual SE | 0.528 | 0.461 | 0.411 | 0.398 | 0.396 | 0.394 | 0.390 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||||
#Prediction
model_data <- data.frame(prop_type_simplified = "Apartment",number_of_reviews = 10,review_scores_rating = 95,room_type="Private room",bathrooms=1,bedrooms=2,beds=2,host_is_superhost=FALSE,is_location_exact = FALSE, cancellation_policy = "moderate",neighbourhood_simplified = "Central",accommodates=2)
predict(model7,newdata=model_data, interval = "prediction")## fit lwr upr
## 1 5.29 4.52 6.05
Using model7, we try to predict the price of accommodation for 2 persons for 4 nights. We included the data shown in the chunk above. From this, we get that the price prediction is $198 with the wide confidence interval ranging from roughly 91 dollars to 425 dollars.